Machine learning output sampling for a data intake and query system

ABSTRACT

Systems and methods are described for providing a user interface through which a user can program operation of a data processing pipeline by specifying a graph of nodes that transform data and interconnections that designate routing of data between individual nodes within the graph. In response to a user request, a preview mode can be activated that causes the data processing pipeline to retrieve data from at least one source specified by the graph, transform the data according to the nodes of the graph, sample the transformed data, and display the sampling of the transformed data to at least one node without writing the transformed data to at least one destination specified by the graph.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/779,486, entitled “SAMPLING-BASED PREVIEW MODE FOR A DATA INTAKE ANDQUERY SYSTEM” and filed on Jan. 31, 2020, which claims the benefit ofU.S. Provisional Patent Application No. 62/923,437, entitled “ANOMALYDETECTION IN DATA INGESTED TO A DATA INTAKE AND QUERY SYSTEM” and filedon Oct. 18, 2019, which are both incorporated by reference herein intheir entireties. Any and all applications for which a foreign ordomestic priority claim is identified in the Application Data Sheet asfiled with the present application are incorporated by reference under37 CFR 1.57 and made a part of this specification. This application alsoincorporates by reference herein the following U.S. App. Nos.: Ser. No.16/148,840, filed Oct. 1, 2018; Ser. No. 16/148,703, filed Oct. 1, 2018;Ser. No. 16/148,736, filed Oct. 1, 2018; and Ser. No. 16/177,234, filedOct. 31, 2018, in their entirety. In addition, the present applicationincorporates by reference herein in its entirety U.S. Provisional PatentApplication No. 62/923,447, filed on Oct. 18, 2019.

This application also incorporates by reference herein in theirentireties the following U.S. Applications:

U.S. application Ser. No. Attorney Docket Title Filing Date 16/779,456SPLK.066A1 ONLINE MACHINE LEARNING Jan. 31, ALGORITHM FOR A DATA INTAKE2020 AND QUERY SYSTEM 16/779,460 SPLK.066A2 ANOMALY AND OUTLIER Jan. 31,EXPLANATION GENERATION FOR 2020 DATA INGESTED TO A DATA INTAKE AND QUERYSYSTEM 16/779,509 SPLK.066A4 SWAPPABLE ONLINE MACHINE Jan. 31, LEARNINGALGORITHMS 2020 IMPLEMENTED IN A DATA INTAKE AND QUERY SYSTEM

FIELD

At least one embodiment of the present disclosure pertains to one ormore tools for facilitating searching and analyzing large sets of datato locate data of interest.

BACKGROUND

Information technology (IT) environments can include diverse types ofdata systems that store large amounts of diverse data types generated bynumerous devices. For example, a big data ecosystem may includedatabases such as MySQL and Oracle databases, cloud computing servicessuch as Amazon web services (AWS), and other data systems that storepassively or actively generated data, including machine-generated data(“machine data”). The machine data can include performance data,diagnostic data, or any other data that can be analyzed to diagnoseequipment performance problems, monitor user interactions, and to deriveother insights.

The large amount and diversity of data systems containing large amountsof structured, semi-structured, and unstructured data relevant to anysearch query can be massive, and continues to grow rapidly. Thistechnological evolution can give rise to various challenges in relationto managing, understanding and effectively utilizing the data. To reducethe potentially vast amount of data that may be generated, some datasystems pre-process data based on anticipated data analysis needs. Inparticular, specified data items may be extracted from the generateddata and stored in a data system to facilitate efficient retrieval andanalysis of those data items at a later time. At least some of theremainder of the generated data is typically discarded duringpre-processing.

However, storing massive quantities of minimally processed orunprocessed data (collectively and individually referred to as “rawdata”) for later retrieval and analysis is becoming increasingly morefeasible as storage capacity becomes more inexpensive and plentiful. Ingeneral, storing raw data and performing analysis on that data later canprovide greater flexibility because it enables an analyst to analyze allof the generated data instead of only a fraction of it.

Although the availability of vastly greater amounts of diverse data ondiverse data systems provides opportunities to derive new insights, italso gives rise to technical challenges to search and analyze the data.Tools exist that allow an analyst to search data systems separately andcollect results over a network for the analyst to derive insights in apiecemeal manner. However, UI tools that allow analysts to quicklysearch and analyze large set of raw machine data to visually identifydata subsets of interest, particularly via straightforward andeasy-to-understand sets of tools and search functionality do not exist.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is illustrated by way of example, and notlimitation, in the figures of the accompanying drawings, in which likereference numerals indicate similar elements.

FIG. 1 is a block diagram of an example networked computer environment,in accordance with example embodiments.

FIG. 2 is a block diagram of an example data intake and query system, inaccordance with example embodiments.

FIG. 3A is a block diagram of one embodiment an intake system.

FIG. 3B is a block diagram of another embodiment of an intake system.

FIG. 4 is a block diagram illustrating an embodiment of an indexingsystem of the data intake and query system.

FIG. 5 is a block diagram illustrating an embodiment of a query systemof the data intake and query system.

FIG. 6 is a flow diagram depicting illustrative interactions forprocessing data through an intake system, in accordance with exampleembodiments.

FIG. 7 is a flowchart depicting an illustrative routine for processingdata at an intake system, according to example embodiments.

FIG. 8 is a data flow diagram illustrating an embodiment of the dataflow and communications between a variety of the components of the dataintake and query system during indexing.

FIG. 9 is a flow diagram illustrative of an embodiment of a routineimplemented by an indexing system to store data in common storage.

FIG. 10 is a flow diagram illustrative of an embodiment of a routineimplemented by an indexing system to store data in common storage.

FIG. 11 is a flow diagram illustrative of an embodiment of a routineimplemented by an indexing node to update a location marker in aningestion buffer.

FIG. 12 is a flow diagram illustrative of an embodiment of a routineimplemented by an indexing node to merge buckets.

FIG. 13 is a data flow diagram illustrating an embodiment of the dataflow and communications between a variety of the components of the dataintake and query system during execution of a query.

FIG. 14 is a flow diagram illustrative of an embodiment of a routineimplemented by a query system to execute a query.

FIG. 15 is a flow diagram illustrative of an embodiment of a routineimplemented by a query system to execute a query.

FIG. 16 is a flow diagram illustrative of an embodiment of a routineimplemented by a query system to identify buckets for query execution.

FIG. 17 is a flow diagram illustrative of an embodiment of a routineimplemented by a query system to identify search nodes for queryexecution.

FIG. 18 is a flow diagram illustrative of an embodiment of a routineimplemented by a query system to hash bucket identifiers for queryexecution.

FIG. 19 is a flow diagram illustrative of an embodiment of a routineimplemented by a search node to execute a search on a bucket.

FIG. 20 is a flow diagram illustrative of an embodiment of a routineimplemented by the query system to store search results.

FIG. 21A is a flowchart of an example method that illustrates howindexers process, index, and store data received from intake system, inaccordance with example embodiments.

FIG. 21B is a block diagram of a data structure in which time-stampedevent data can be stored in a data store, in accordance with exampleembodiments.

FIG. 21C provides a visual representation of the manner in which apipelined search language or query operates, in accordance with exampleembodiments.

FIG. 22A is a flow diagram of an example method that illustrates how asearch head and indexers perform a search query, in accordance withexample embodiments.

FIG. 22B provides a visual representation of an example manner in whicha pipelined command language or query operates, in accordance withexample embodiments.

FIG. 23A is a diagram of an example scenario where a common customeridentifier is found among log data received from three disparate datasources, in accordance with example embodiments.

FIG. 23B illustrates an example of processing keyword searches and fieldsearches, in accordance with disclosed embodiments.

FIG. 23C illustrates an example of creating and using an inverted index,in accordance with example embodiments.

FIG. 23D depicts a flowchart of example use of an inverted index in apipelined search query, in accordance with example embodiments.

FIG. 24A is an interface diagram of an example user interface for asearch screen, in accordance with example embodiments.

FIG. 24B is an interface diagram of an example user interface for a datasummary dialog that enables a user to select various data sources, inaccordance with example embodiments.

FIGS. 25, 26, 27A-27D, 28, 29, 30, and 31 are interface diagrams ofexample report generation user interfaces, in accordance with exampleembodiments.

FIG. 32 is an example search query received from a client and executedby search peers, in accordance with example embodiments.

FIG. 33A is an interface diagram of an example user interface of a keyindicators view, in accordance with example embodiments.

FIG. 33B is an interface diagram of an example user interface of anincident review dashboard, in accordance with example embodiments.

FIG. 33C is a tree diagram of an example a proactive monitoring tree, inaccordance with example embodiments.

FIG. 33D is an interface diagram of an example a user interfacedisplaying both log data and performance data, in accordance withexample embodiments.

FIG. 34A is a block diagram of one embodiment of a streaming dataprocessor.

FIG. 34B is a block diagram of one embodiment of distributed patternmatcher tasks.

FIG. 34C is a block diagram of one embodiment of distributed pipelinemetric outlier detector tasks.

FIG. 35 illustrates an example anomaly and pattern workbook viewrendered and displayed by the client browser in which the anomaly andpattern workbook view depicts various information about anomaliesdetected by the anomaly detector of the streaming data processor.

FIG. 36 illustrates an example anomaly and pattern workbook viewrendered and displayed by the client browser in which the user haselected to expand carrot to show the specific anomalous eventscorresponding to the first row in the list.

FIG. 37 illustrates an example anomaly and pattern workbook viewrendered and displayed by the client browser in which the user haselected to view events surrounding a particular anomalous event.

FIG. 38 illustrates an example anomaly and pattern workbook viewrendered and displayed by the client browser in which the user hashidden the anomalous event information and expanded the normal eventinformation.

FIG. 39 illustrates an example pattern catalog view rendered anddisplayed by the client browser in which events that match or areotherwise assigned to a certain data pattern are displayed.

FIG. 40 illustrates another example pattern catalog view rendered anddisplayed by the client browser in which trends in event occurrencesand/or event anomaly detections are displayed.

FIG. 41 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to detect an anomalous log.

FIG. 42 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to determine whether acomparable data structure should be assigned to a data pattern.

FIG. 43 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to assign a comparable datastructure to a data pattern in real-time.

FIG. 44 is another flow diagram illustrative of an embodiment of aroutine implemented by the streaming data processor to assign acomparable data structure to a data pattern in real-time.

FIG. 45 is another flow diagram illustrative of an embodiment of aroutine implemented by the streaming data processor to merge datapatterns in real-time.

FIG. 46 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to detect an anomalouspipeline metric.

FIG. 47 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to detect an anomalousmetric.

FIG. 48 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to assign a set of metricsto a metric cluster in real-time.

FIG. 49 is another flow diagram illustrative of an embodiment of aroutine implemented by the streaming data processor to assign a set ofmetrics to a metric cluster in real-time.

FIG. 50 is another flow diagram illustrative of an embodiment of aroutine implemented by the streaming data processor to merge metricclusters in real-time.

FIG. 51 illustrates another example anomaly and pattern workbook viewrendered and displayed by the client browser in which the anomaly andpattern workbook view depicts various information about anomaliesdetected by the anomaly detector.

FIGS. 52A-52B illustrate other example anomaly and pattern workbookviews and rendered and displayed by the client browser in which theanomaly and pattern workbook views and depict various information aboutanomalies detected by the anomaly detector.

FIGS. 53A-53B illustrate other example anomaly and pattern workbookviews and rendered and displayed by the client browser in which theanomaly and pattern workbook views and depict various information aboutanomalies detected by the anomaly detector.

FIGS. 54A-54B illustrate other example anomaly and pattern workbookviews and rendered and displayed by the client browser in which theanomaly and pattern workbook views and depict various information aboutanomalies detected by the anomaly detector.

FIGS. 55A-55B illustrate other example anomaly and pattern workbookviews and rendered and displayed by the client browser in which theanomaly and pattern workbook views and depict various information aboutanomalies detected by the anomaly detector 3406 during the time rangecorresponding to the bucket.

FIGS. 56-58 illustrate other example anomaly and pattern workbook viewsrendered and displayed by the client browser in which the anomaly andpattern workbook views depict more detailed information about anomaliesdetected by the anomaly detector.

FIG. 59 illustrates an example anomaly and pattern workbook viewrendered and displayed by the client browser in which the user haselected to view events surrounding a particular anomalous event.

FIG. 60 is another block diagram of one embodiment of a streaming dataprocessor.

FIG. 61 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to implement an onlinemachine learning model.

FIG. 62 illustrates a graph depicting various values generated overtime.

FIG. 63 illustrates a data processing pipeline that includes an adaptivethresholder.

FIG. 64 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to perform adaptivethresholding.

FIG. 65 illustrates a data processing pipeline that includes asequential outlier detector.

FIG. 66 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to perform sequentialoutlier detection.

FIG. 67 is another flow diagram illustrative of an embodiment of aroutine implemented by the streaming data processor to performsequential outlier detection.

FIG. 68 illustrates a data processing pipeline that includes a sentimentanalyzer.

FIG. 69 illustrates an example block diagram of the sentiment analyzerdepicting operations that are performed when raw machine data includesboth text and a rating or label.

FIG. 70 illustrates an example block diagram of the sentiment analyzerdepicting operations that are performed when raw machine data includesthe text, but no rating or label.

FIG. 71 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to perform sentimentanalysis.

FIG. 72 illustrates a graph showing time-series data values.

FIG. 73 illustrates a data processing pipeline that includes a driftdetector.

FIG. 74 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to perform drift detectionin time-series data.

FIG. 75 illustrates a data processing pipeline that includes an anomalyexplainer.

FIG. 76 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to explain anomalies

FIG. 77 is a block diagram of one embodiment a graphical programmingsystem that provides a graphical interface for designing data processingpipelines, in accordance with example embodiments.

FIG. 78 is an interface diagram of an example user interface forpreviewing a data processing pipeline being designed in the userinterface, in accordance with example embodiments.

FIG. 79A is a block diagram of a graph representing a data processingpipeline, in accordance with example embodiments.

FIG. 79B is a block diagram of the graph of FIG. 79A having added nodesto facilitate the disclosed data processing pipeline previews, inaccordance with example embodiments.

FIG. 80 is a flow diagram depicting illustrative interactions forgenerating data processing pipeline previews, in accordance with exampleembodiments.

FIG. 81 depicts an illustrative algorithm or routine implemented by thegraphical programming system to generate data processing pipelinepreviews.

FIG. 82 is a block diagram of a graph representing a data processingpipeline, in accordance with example embodiments.

FIG. 83 is another block diagram of a graph representing the dataprocessing pipeline of FIG. 82 , in accordance with example embodiments.

FIG. 84 is a flow diagram illustrative of an embodiment of a routineimplemented by the streaming data processor to test and swap machinelearning algorithms.

DETAILED DESCRIPTION

Embodiments are described herein according to the following outline:

-   -   1.0. General Overview    -   2.0. Operating Environment        -   2.1. Host Devices        -   2.2. Client Devices        -   2.3. Client Device Applications        -   2.4. Data Intake and Query System Overview    -   3.0. Data Intake and Query System Architecture        -   3.1. Intake System            -   3.1.1 Forwarder            -   3.1.2 Data Retrieval Subsystem            -   3.1.3 Ingestion Buffer            -   3.1.4 Streaming Data Processors        -   3.2. Indexing System            -   3.2.1. Indexing System Manager            -   3.2.2. Indexing Nodes                -   3.2.2.1 Indexing Node Manager                -   3.2.2.2 Partition Manager                -   3.2.2.3 Indexer and Data Store            -   3.2.3. Bucket Manager        -   3.3 Query System            -   3.3.1. Query System Manager            -   3.3.2. Search Head                -   3.3.2.1 Search Master                -   3.3.2.2 Search Manager            -   3.3.3. Search Nodes            -   3.3.4. Cache Manager            -   3.3.5. Search Node Monitor and Catalog        -   3.4. Common Storage        -   3.5. Data Store Catalog        -   3.6. Query Acceleration Data Store    -   4.0. Data Intake and Query System Functions        -   4.1. Ingestion            -   4.1.1 Publication to Intake Topic(s)            -   4.1.2 Transmission to Streaming Data Processors            -   4.1.3 Messages Processing            -   4.1.4 Transmission to Subscribers            -   4.1.5 Data Resiliency and Security            -   4.1.6 Message Processing Algorithm        -   4.2. Indexing            -   4.2.1. Containerized Indexing Nodes            -   4.2.2. Moving Buckets to Common Storage            -   4.2.3. Updating Location Marker in Ingestion Buffer            -   4.2.4. Merging Buckets        -   4.3. Querying            -   4.3.1. Containerized Search Nodes            -   4.3.2. Identifying Buckets for Query Execution            -   4.3.4. Hashing Bucket Identifiers for Query Execution            -   4.3.5. Mapping Buckets to Search Nodes            -   4.3.6. Obtaining Data for Query Execution            -   4.3.7. Caching Search Results        -   4.4. Data Ingestion, Indexing, and Storage Flow            -   4.4.1. Input            -   4.4.2. Parsing            -   4.4.3. Indexing        -   4.5. Query Processing Flow        -   4.6. Pipelined Search Language        -   4.7. Field Extraction        -   4.8. Example Search Screen        -   4.9. Data Models        -   4.10. Acceleration Techniques            -   4.10.1. Aggregation Technique            -   4.10.2. Keyword Index            -   4.10.3. High Performance Analytics Store                -   4.10.3.1 Extracting Event Data Using Posting            -   4.10.4. Accelerating Report Generation        -   4.12. Security Features        -   4.13. Data Center Monitoring        -   4.14. IT Service Monitoring        -   4.15. Anomaly Detection            -   4.15.1. Anomaly Detection Architecture                -   4.15.1.1. Pattern Matching Distributed Architecture                -   4.15.1.2. Anomaly Detection in Logs                -   4.15.1.3. Outlier Detection Distributed Architecture            -   4.15.2. Data Pattern and Anomaly User Interfaces            -   4.15.3. Anomalous Log Detection Routines            -   4.15.4. Anomalous Pipeline Metric Detection Routines        -   4.16. Online Machine Learning            -   4.16.1. Adaptive Thresholding            -   4.16.2. Sequential Outlier Detection            -   4.16.3. Sentiment Analysis            -   4.16.4. Drift Detection            -   4.16.5. Explainability            -   4.16.6. Preview Mode            -   4.16.7. A/B Testing and Algorithm Swapping        -   4.17. Other Architectures    -   5.0. Terminology    -   6.0. Example Embodiments

1.0. General Overview

Modern data centers and other computing environments can compriseanywhere from a few host computer systems to thousands of systemsconfigured to process data, service requests from remote clients, andperform numerous other computational tasks. During operation, variouscomponents within these computing environments often generatesignificant volumes of machine data. Machine data is any data producedby a machine or component in an information technology (IT) environmentand that reflects activity in the IT environment. For example, machinedata can be raw machine data that is generated by various components inIT environments, such as servers, sensors, routers, mobile devices,Internet of Things (IoT) devices, etc. Machine data can include systemlogs, network packet data, sensor data, application program data, errorlogs, stack traces, system performance data, etc. In general, machinedata can also include performance data, diagnostic information, and manyother types of data that can be analyzed to diagnose performanceproblems, monitor user interactions, and to derive other insights.

A number of tools are available to analyze machine data. In order toreduce the size of the potentially vast amount of machine data that maybe generated, many of these tools typically pre-process the data basedon anticipated data-analysis needs. For example, pre-specified dataitems may be extracted from the machine data and stored in a database tofacilitate efficient retrieval and analysis of those data items atsearch time. However, the rest of the machine data typically is notsaved and is discarded during pre-processing. As storage capacitybecomes progressively cheaper and more plentiful, there are fewerincentives to discard these portions of machine data and many reasons toretain more of the data.

This plentiful storage capacity is presently making it feasible to storemassive quantities of minimally processed machine data for laterretrieval and analysis. In general, storing minimally processed machinedata and performing analysis operations at search time can providegreater flexibility because it enables an analyst to search all of themachine data, instead of searching only a pre-specified set of dataitems. This may enable an analyst to investigate different aspects ofthe machine data that previously were unavailable for analysis.

However, analyzing and searching massive quantities of machine datapresents a number of challenges. For example, a data center, servers, ornetwork appliances may generate many different types and formats ofmachine data (e.g., system logs, network packet data (e.g., wire data,etc.), sensor data, application program data, error logs, stack traces,system performance data, operating system data, virtualization data,etc.) from thousands of different components, which can collectively bevery time-consuming to analyze. In another example, mobile devices maygenerate large amounts of information relating to data accesses,application performance, operating system performance, networkperformance, etc. There can be millions of mobile devices that reportthese types of information.

These challenges can be addressed by using an event-based data intakeand query system, such as the SPLUNK® ENTERPRISE system developed bySplunk Inc. of San Francisco, Calif. The SPLUNK® ENTERPRISE system isthe leading platform for providing real-time operational intelligencethat enables organizations to collect, index, and search machine datafrom various websites, applications, servers, networks, and mobiledevices that power their businesses. The data intake and query system isparticularly useful for analyzing data which is commonly found in systemlog files, network data, and other data input sources. Although many ofthe techniques described herein are explained with reference to a dataintake and query system similar to the SPLUNK® ENTERPRISE system, thesetechniques are also applicable to other types of data systems.

In the data intake and query system, machine data are collected andstored as “events”. An event comprises a portion of machine data and isassociated with a specific point in time. The portion of machine datamay reflect activity in an IT environment and may be produced by acomponent of that IT environment, where the events may be searched toprovide insight into the IT environment, thereby improving theperformance of components in the IT environment. Events may be derivedfrom “time series data,” where the time series data comprises a sequenceof data points (e.g., performance measurements from a computer system,etc.) that are associated with successive points in time. In general,each event has a portion of machine data that is associated with atimestamp that is derived from the portion of machine data in the event.A timestamp of an event may be determined through interpolation betweentemporally proximate events having known timestamps or may be determinedbased on other configurable rules for associating timestamps withevents.

In some instances, machine data can have a predefined format, where dataitems with specific data formats are stored at predefined locations inthe data. For example, the machine data may include data associated withfields in a database table. In other instances, machine data may nothave a predefined format (e.g., may not be at fixed, predefinedlocations), but may have repeatable (e.g., non-random) patterns. Thismeans that some machine data can comprise various data items ofdifferent data types that may be stored at different locations withinthe data. For example, when the data source is an operating system log,an event can include one or more lines from the operating system logcontaining machine data that includes different types of performance anddiagnostic information associated with a specific point in time (e.g., atimestamp).

Examples of components which may generate machine data from which eventscan be derived include, but are not limited to, web servers, applicationservers, databases, firewalls, routers, operating systems, and softwareapplications that execute on computer systems, mobile devices, sensors,Internet of Things (IoT) devices, etc. The machine data generated bysuch data sources can include, for example and without limitation,server log files, activity log files, configuration files, messages,network packet data, performance measurements, sensor measurements, etc.

The data intake and query system uses a flexible schema to specify howto extract information from events. A flexible schema may be developedand redefined as needed. Note that a flexible schema may be applied toevents “on the fly,” when it is needed (e.g., at search time, indextime, ingestion time, etc.). When the schema is not applied to eventsuntil search time, the schema may be referred to as a “late-bindingschema.”

During operation, the data intake and query system receives machine datafrom any type and number of sources (e.g., one or more system logs,streams of network packet data, sensor data, application program data,error logs, stack traces, system performance data, etc.). The systemparses the machine data to produce events each having a portion ofmachine data associated with a timestamp. The system stores the eventsin a data store. The system enables users to run queries against thestored events to, for example, retrieve events that meet criteriaspecified in a query, such as criteria indicating certain keywords orhaving specific values in defined fields. As used herein, the term“field” refers to a location in the machine data of an event containingone or more values for a specific data item. A field may be referencedby a field name associated with the field. As will be described in moredetail herein, a field is defined by an extraction rule (e.g., a regularexpression) that derives one or more values or a sub-portion of textfrom the portion of machine data in each event to produce a value forthe field for that event. The set of values produced aresemantically-related (such as IP address), even though the machine datain each event may be in different formats (e.g., semantically-relatedvalues may be in different positions in the events derived fromdifferent sources).

As described above, the system stores the events in a data store. Theevents stored in the data store are field-searchable, wherefield-searchable herein refers to the ability to search the machine data(e.g., the raw machine data) of an event based on a field specified insearch criteria. For example, a search having criteria that specifies afield name “UserID” may cause the system to field-search the machinedata of events to identify events that have the field name “UserID.” Inanother example, a search having criteria that specifies a field name“UserID” with a corresponding field value “12345” may cause the systemto field-search the machine data of events to identify events havingthat field-value pair (e.g., field name “UserID” with a correspondingfield value of “12345”). Events are field-searchable using one or moreconfiguration files associated with the events. Each configuration fileincludes one or more field names, where each field name is associatedwith a corresponding extraction rule and a set of events to which thatextraction rule applies. The set of events to which an extraction ruleapplies may be identified by metadata associated with the set of events.For example, an extraction rule may apply to a set of events that areeach associated with a particular host, source, or source type. Whenevents are to be searched based on a particular field name specified ina search, the system uses one or more configuration files to determinewhether there is an extraction rule for that particular field name thatapplies to each event that falls within the criteria of the search. Ifso, the event is considered as part of the search results (andadditional processing may be performed on that event based on criteriaspecified in the search). If not, the next event is similarly analyzed,and so on.

As noted above, the data intake and query system utilizes a late-bindingschema while performing queries on events. One aspect of a late-bindingschema is applying extraction rules to events to extract values forspecific fields during search time. More specifically, the extractionrule for a field can include one or more instructions that specify howto extract a value for the field from an event. An extraction rule cangenerally include any type of instruction for extracting values fromevents. In some cases, an extraction rule comprises a regularexpression, where a sequence of characters form a search pattern. Anextraction rule comprising a regular expression is referred to herein asa regex rule. The system applies a regex rule to an event to extractvalues for a field associated with the regex rule, where the values areextracted by searching the event for the sequence of characters definedin the regex rule.

In the data intake and query system, a field extractor may be configuredto automatically generate extraction rules for certain fields in theevents when the events are being created, indexed, or stored, orpossibly at a later time. Alternatively, a user may manually defineextraction rules for fields using a variety of techniques. In contrastto a conventional schema for a database system, a late-binding schema isnot defined at data ingestion time. Instead, the late-binding schema canbe developed on an ongoing basis until the time a query is actuallyexecuted. This means that extraction rules for the fields specified in aquery may be provided in the query itself, or may be located duringexecution of the query. Hence, as a user learns more about the data inthe events, the user can continue to refine the late-binding schema byadding new fields, deleting fields, or modifying the field extractionrules for use the next time the schema is used by the system. Becausethe data intake and query system maintains the underlying machine dataand uses a late-binding schema for searching the machine data, itenables a user to continue investigating and learn valuable insightsabout the machine data.

In some embodiments, a common field name may be used to reference two ormore fields containing equivalent and/or similar data items, even thoughthe fields may be associated with different types of events thatpossibly have different data formats and different extraction rules. Byenabling a common field name to be used to identify equivalent and/orsimilar fields from different types of events generated by disparatedata sources, the system facilitates use of a “common information model”(CIM) across the disparate data sources (further discussed with respectto FIG. 23A).

2.0. Operating Environment

FIG. 1 is a block diagram of an example networked computer environment100, in accordance with example embodiments. It will be understood thatFIG. 1 represents one example of a networked computer system and otherembodiments may use different arrangements.

The networked computer system 100 comprises one or more computingdevices. These one or more computing devices comprise any combination ofhardware and software configured to implement the various logicalcomponents described herein. For example, the one or more computingdevices may include one or more memories that store instructions forimplementing the various components described herein, one or morehardware processors configured to execute the instructions stored in theone or more memories, and various data repositories in the one or morememories for storing data structures utilized and manipulated by thevarious components.

In some embodiments, one or more client devices 102 are coupled to oneor more host devices 106 and a data intake and query system 108 via oneor more networks 104. Networks 104 broadly represent one or more LANs,WANs, cellular networks (e.g., LTE, HSPA, 3G, and other cellulartechnologies), and/or networks using any of wired, wireless, terrestrialmicrowave, or satellite links, and may include the public Internet.

2.1. Host Devices

In the illustrated embodiment, a system 100 includes one or more hostdevices 106. Host devices 106 may broadly include any number ofcomputers, virtual machine instances, and/or data centers that areconfigured to host or execute one or more instances of host applications114. In general, a host device 106 may be involved, directly orindirectly, in processing requests received from client devices 102.Each host device 106 may comprise, for example, one or more of a networkdevice, a web server, an application server, a database server, etc. Acollection of host devices 106 may be configured to implement anetwork-based service. For example, a provider of a network-basedservice may configure one or more host devices 106 and host applications114 (e.g., one or more web servers, application servers, databaseservers, etc.) to collectively implement the network-based application.

In general, client devices 102 communicate with one or more hostapplications 114 to exchange information. The communication between aclient device 102 and a host application 114 may, for example, be basedon the Hypertext Transfer Protocol (HTTP) or any other network protocol.Content delivered from the host application 114 to a client device 102may include, for example, HTML documents, media content, etc. Thecommunication between a client device 102 and host application 114 mayinclude sending various requests and receiving data packets. Forexample, in general, a client device 102 or application running on aclient device may initiate communication with a host application 114 bymaking a request for a specific resource (e.g., based on an HTTPrequest), and the application server may respond with the requestedcontent stored in one or more response packets.

In the illustrated embodiment, one or more of host applications 114 maygenerate various types of performance data during operation, includingevent logs, network data, sensor data, and other types of machine data.For example, a host application 114 comprising a web server may generateone or more web server logs in which details of interactions between theweb server and any number of client devices 102 is recorded. As anotherexample, a host device 106 comprising a router may generate one or morerouter logs that record information related to network traffic managedby the router. As yet another example, a host application 114 comprisinga database server may generate one or more logs that record informationrelated to requests sent from other host applications 114 (e.g., webservers or application servers) for data managed by the database server.

2.2. Client Devices

Client devices 102 of FIG. 1 represent any computing device capable ofinteracting with one or more host devices 106 via a network 104.Examples of client devices 102 may include, without limitation, smartphones, tablet computers, handheld computers, wearable devices, laptopcomputers, desktop computers, servers, portable media players, gamingdevices, and so forth. In general, a client device 102 can provideaccess to different content, for instance, content provided by one ormore host devices 106, etc. Each client device 102 may comprise one ormore client applications 110, described in more detail in a separatesection hereinafter.

2.3. Client Device Applications

In some embodiments, each client device 102 may host or execute one ormore client applications 110 that are capable of interacting with one ormore host devices 106 via one or more networks 104. For instance, aclient application 110 may be or comprise a web browser that a user mayuse to navigate to one or more websites or other resources provided byone or more host devices 106. As another example, a client application110 may comprise a mobile application or “app.” For example, an operatorof a network-based service hosted by one or more host devices 106 maymake available one or more mobile apps that enable users of clientdevices 102 to access various resources of the network-based service. Asyet another example, client applications 110 may include backgroundprocesses that perform various operations without direct interactionfrom a user. A client application 110 may include a “plug-in” or“extension” to another application, such as a web browser plug-in orextension.

In some embodiments, a client application 110 may include a monitoringcomponent 112. At a high level, the monitoring component 112 comprises asoftware component or other logic that facilitates generatingperformance data related to a client device's operating state, includingmonitoring network traffic sent and received from the client device andcollecting other device and/or application-specific information.Monitoring component 112 may be an integrated component of a clientapplication 110, a plug-in, an extension, or any other type of add-oncomponent. Monitoring component 112 may also be a stand-alone process.

In some embodiments, a monitoring component 112 may be created when aclient application 110 is developed, for example, by an applicationdeveloper using a software development kit (SDK). The SDK may includecustom monitoring code that can be incorporated into the codeimplementing a client application 110. When the code is converted to anexecutable application, the custom code implementing the monitoringfunctionality can become part of the application itself.

In some embodiments, an SDK or other code for implementing themonitoring functionality may be offered by a provider of a data intakeand query system, such as a system 108. In such cases, the provider ofthe system 108 can implement the custom code so that performance datagenerated by the monitoring functionality is sent to the system 108 tofacilitate analysis of the performance data by a developer of the clientapplication or other users.

In some embodiments, the custom monitoring code may be incorporated intothe code of a client application 110 in a number of different ways, suchas the insertion of one or more lines in the client application codethat call or otherwise invoke the monitoring component 112. As such, adeveloper of a client application 110 can add one or more lines of codeinto the client application 110 to trigger the monitoring component 112at desired points during execution of the application. Code thattriggers the monitoring component may be referred to as a monitortrigger. For instance, a monitor trigger may be included at or near thebeginning of the executable code of the client application 110 such thatthe monitoring component 112 is initiated or triggered as theapplication is launched, or included at other points in the code thatcorrespond to various actions of the client application, such as sendinga network request or displaying a particular interface.

In some embodiments, the monitoring component 112 may monitor one ormore aspects of network traffic sent and/or received by a clientapplication 110. For example, the monitoring component 112 may beconfigured to monitor data packets transmitted to and/or from one ormore host applications 114. Incoming and/or outgoing data packets can beread or examined to identify network data contained within the packets,for example, and other aspects of data packets can be analyzed todetermine a number of network performance statistics. Monitoring networktraffic may enable information to be gathered particular to the networkperformance associated with a client application 110 or set ofapplications.

In some embodiments, network performance data refers to any type of datathat indicates information about the network and/or network performanceNetwork performance data may include, for instance, a URL requested, aconnection type (e.g., HTTP, HTTPS, etc.), a connection start time, aconnection end time, an HTTP status code, request length, responselength, request headers, response headers, connection status (e.g.,completion, response time(s), failure, etc.), and the like. Uponobtaining network performance data indicating performance of thenetwork, the network performance data can be transmitted to a dataintake and query system 108 for analysis.

Upon developing a client application 110 that incorporates a monitoringcomponent 112, the client application 110 can be distributed to clientdevices 102. Applications generally can be distributed to client devices102 in any manner, or they can be pre-loaded. In some cases, theapplication may be distributed to a client device 102 via an applicationmarketplace or other application distribution system. For instance, anapplication marketplace or other application distribution system mightdistribute the application to a client device based on a request fromthe client device to download the application.

Examples of functionality that enables monitoring performance of aclient device are described in U.S. patent application Ser. No.14/524,748, entitled “UTILIZING PACKET HEADERS TO MONITOR NETWORKTRAFFIC IN ASSOCIATION WITH A CLIENT DEVICE”, filed on 27 Oct. 2014, andwhich is hereby incorporated by reference in its entirety for allpurposes.

In some embodiments, the monitoring component 112 may also monitor andcollect performance data related to one or more aspects of theoperational state of a client application 110 and/or client device 102.For example, a monitoring component 112 may be configured to collectdevice performance information by monitoring one or more client deviceoperations, or by making calls to an operating system and/or one or moreother applications executing on a client device 102 for performanceinformation. Device performance information may include, for instance, acurrent wireless signal strength of the device, a current connectiontype and network carrier, current memory performance information, ageographic location of the device, a device orientation, and any otherinformation related to the operational state of the client device.

In some embodiments, the monitoring component 112 may also monitor andcollect other device profile information including, for example, a typeof client device, a manufacturer, and model of the device, versions ofvarious software applications installed on the device, and so forth.

In general, a monitoring component 112 may be configured to generateperformance data in response to a monitor trigger in the code of aclient application 110 or other triggering application event, asdescribed above, and to store the performance data in one or more datarecords. Each data record, for example, may include a collection offield-value pairs, each field-value pair storing a particular item ofperformance data in association with a field for the item. For example,a data record generated by a monitoring component 112 may include a“networkLatency” field (not shown in the Figure) in which a value isstored. This field indicates a network latency measurement associatedwith one or more network requests. The data record may include a “state”field to store a value indicating a state of a network connection, andso forth for any number of aspects of collected performance data.

2.4. Data Intake and Query System Overview

The data intake and query system 108 can process and store data receiveddata from the data sources client devices 102 or host devices 106, andexecute queries on the data in response to requests received from one ormore computing devices. In some cases, the data intake and query system108 can generate events from the received data and store the events inbuckets in a common storage system. In response to received queries, thedata intake and query system can assign one or more search nodes tosearch the buckets in the common storage.

In certain embodiments, the data intake and query system 108 can includevarious components that enable it to provide stateless services orenable it to recover from an unavailable or unresponsive componentwithout data loss in a time efficient manner For example, the dataintake and query system 108 can store contextual information about itsvarious components in a distributed way such that if one of thecomponents becomes unresponsive or unavailable, the data intake andquery system 108 can replace the unavailable component with a differentcomponent and provide the replacement component with the contextualinformation. In this way, the data intake and query system 108 canquickly recover from an unresponsive or unavailable component whilereducing or eliminating the loss of data that was being processed by theunavailable component.

3.0. Data Intake and Query System Architecture

FIG. 2 is a block diagram of an embodiment of a data processingenvironment 200. In the illustrated embodiment, the environment 200includes data sources 202 and client devices 204 a, 204 b, 204 c(generically referred to as client device(s) 204) in communication witha data intake and query system 108 via networks 206, 208, respectively.The networks 206, 208 may be the same network, may correspond to thenetwork 104, or may be different networks. Further, the networks 206,208 may be implemented as one or more LANs, WANs, cellular networks,intranetworks, and/or internetworks using any of wired, wireless,terrestrial microwave, satellite links, etc., and may include theInternet.

Each data source 202 broadly represents a distinct source of data thatcan be consumed by the data intake and query system 108. Examples ofdata sources 202 include, without limitation, data files, directories offiles, data sent over a network, event logs, registries, streaming dataservices (examples of which can include, by way of non-limiting example,Amazon's Simple Queue Service (“SQS”) or Kinesis™ services, devicesexecuting Apache Kafka™ software, or devices implementing the MessageQueue Telemetry Transport (MQTT) protocol, Microsoft Azure EventHub,Google Cloud PubSub, devices implementing the Java Message Service (JMS)protocol, devices implementing the Advanced Message Queuing Protocol(AMQP)), performance metrics, etc.

The client devices 204 can be implemented using one or more computingdevices in communication with the data intake and query system 108, andrepresent some of the different ways in which computing devices cansubmit queries to the data intake and query system 108. For example, theclient device 204 a is illustrated as communicating over an Internet(Web) protocol with the data intake and query system 108, the clientdevice 204 b is illustrated as communicating with the data intake andquery system 108 via a command line interface, and the client device 204b is illustrated as communicating with the data intake and query system108 via a software developer kit (SDK). However, it will be understoodthat the client devices 204 can communicate with and submit queries tothe data intake and query system 108 in a variety of ways.

The data intake and query system 108 can process and store data receiveddata from the data sources 202 and execute queries on the data inresponse to requests received from the client devices 204. In theillustrated embodiment, the data intake and query system 108 includes anintake system 210, an indexing system 212, a query system 214, commonstorage 216 including one or more data stores 218, a data store catalog220, and a query acceleration data store 222.

As mentioned, the data intake and query system 108 can receive data fromdifferent sources 202. In some cases, the data sources 202 can beassociated with different tenants or customers. Further, each tenant maybe associated with one or more indexes, hosts, sources, sourcetypes, orusers. For example, company ABC, Inc. can correspond to one tenant andcompany XYZ, Inc. can correspond to a different tenant. While the twocompanies may be unrelated, each company may have a main index and testindex associated with it, as well as one or more data sources or systems(e.g., billing system, CRM system, etc.). The data intake and querysystem 108 can concurrently receive and process the data from thevarious systems and sources of ABC, Inc. and XYZ, Inc.

In certain cases, although the data from different tenants can beprocessed together or concurrently, the data intake and query system 108can take steps to avoid combining or co-mingling data from the differenttenants. For example, the data intake and query system 108 can assign atenant identifier for each tenant and maintain a separation between thedata using the tenant identifier. In some cases, the tenant identifiercan be assigned to the data at the data sources 202, or can be assignedto the data by the data intake and query system 108 at ingest.

As will be described in greater detail herein, at least with referenceto FIGS. 3A and 3B, the intake system 210 can receive data from the datasources 202, perform one or more preliminary processing operations onthe data, and communicate the data to the indexing system 212, querysystem 214, or to other systems 262 (which may include, for example,data processing systems, telemetry systems, real-time analytics systems,data stores, databases, etc., any of which may be operated by anoperator of the data intake and query system 108 or a third party). Theintake system 210 can receive data from the data sources 202 in avariety of formats or structures. In some embodiments, the received datacorresponds to raw machine data, structured or unstructured data,correlation data, data files, directories of files, data sent over anetwork, event logs, registries, messages published to streaming datasources, performance metrics, sensor data, image and video data, etc.The intake system 210 can process the data based on the form in which itis received. In some cases, the intake system 210 can utilize one ormore rules to process data and to make the data available to downstreamsystems (e.g., the indexing system 212, query system 214, etc.).Illustratively, the intake system 210 can enrich the received data. Forexample, the intake system may add one or more fields to the datareceived from the data sources 202, such as fields denoting the host,source, sourcetype, index, or tenant associated with the incoming data.In certain embodiments, the intake system 210 can perform additionalprocessing on the incoming data, such as transforming structured datainto unstructured data (or vice versa), identifying timestampsassociated with the data, removing extraneous data, parsing data,indexing data, separating data, categorizing data, routing data based oncriteria relating to the data being routed, and/or performing other datatransformations, etc.

As will be described in greater detail herein, at least with referenceto FIG. 4 , the indexing system 212 can process the data and store it,for example, in common storage 216. As part of processing the data, theindexing system can identify timestamps associated with the data,organize the data into buckets or time series buckets, convert editablebuckets to non-editable buckets, store copies of the buckets in commonstorage 216, merge buckets, generate indexes of the data, etc. Inaddition, the indexing system 212 can update the data store catalog 220with information related to the buckets (pre-merged or merged) or datathat is stored in common storage 216, and can communicate with theintake system 210 about the status of the data storage.

As will be described in greater detail herein, at least with referenceto FIG. 5 , the query system 214 can receive queries that identify a setof data to be processed and a manner of processing the set of data fromone or more client devices 204, process the queries to identify the setof data, and execute the query on the set of data. In some cases, aspart of executing the query, the query system 214 can use the data storecatalog 220 to identify the set of data to be processed or its locationin common storage 216 and/or can retrieve data from common storage 216or the query acceleration data store 222. In addition, in someembodiments, the query system 214 can store some or all of the queryresults in the query acceleration data store 222.

As mentioned and as will be described in greater detail below, thecommon storage 216 can be made up of one or more data stores 218 storingdata that has been processed by the indexing system 212. The commonstorage 216 can be configured to provide high availability, highlyresilient, low loss data storage. In some cases, to provide the highavailability, highly resilient, low loss data storage, the commonstorage 216 can store multiple copies of the data in the same anddifferent geographic locations and across different types of data stores(e.g., solid state, hard drive, tape, etc.). Further, as data isreceived at the common storage 216 it can be automatically replicatedmultiple times according to a replication factor to different datastores across the same and/or different geographic locations. In someembodiments, the common storage 216 can correspond to cloud storage,such as Amazon Simple Storage Service (S3) or Elastic Block Storage(EBS), Google Cloud Storage, Microsoft Azure Storage, etc.

In some embodiments, indexing system 212 can read to and write from thecommon storage 216. For example, the indexing system 212 can copybuckets of data from its local or shared data stores to the commonstorage 216. In certain embodiments, the query system 214 can read from,but cannot write to, the common storage 216. For example, the querysystem 214 can read the buckets of data stored in common storage 216 bythe indexing system 212, but may not be able to copy buckets or otherdata to the common storage 216. In some embodiments, the intake system210 does not have access to the common storage 216. However, in someembodiments, one or more components of the intake system 210 can writedata to the common storage 216 that can be read by the indexing system212.

As described herein, such as with reference to FIGS. 5B and 5C, in someembodiments, data in the data intake and query system 108 (e.g., in thedata stores of the indexers of the indexing system 212, common storage216, or search nodes of the query system 214) can be stored in one ormore time series buckets. Each bucket can include raw machine dataassociated with a time stamp and additional information about the dataor bucket, such as, but not limited to, one or more filters, indexes(e.g., TSIDX, inverted indexes, keyword indexes, etc.), bucketsummaries, etc. In some embodiments, the bucket data and informationabout the bucket data is stored in one or more files. For example, theraw machine data, filters, indexes, bucket summaries, etc. can be storedin respective files in or associated with a bucket. In certain cases,the group of files can be associated together to form the bucket.

The data store catalog 220 can store information about the data storedin common storage 216, such as, but not limited to an identifier for aset of data or buckets, a location of the set of data, tenants orindexes associated with the set of data, timing information about thedata, etc. For example, in embodiments where the data in common storage216 is stored as buckets, the data store catalog 220 can include abucket identifier for the buckets in common storage 216, a location ofor path to the bucket in common storage 216, a time range of the data inthe bucket (e.g., range of time between the first-in-time event of thebucket and the last-in-time event of the bucket), a tenant identifieridentifying a customer or computing device associated with the bucket,and/or an index (also referred to herein as a partition) associated withthe bucket, etc. In certain embodiments, the data intake and querysystem 108 includes multiple data store catalogs 220. For example, insome embodiments, the data intake and query system 108 can include adata store catalog 220 for each tenant (or group of tenants), eachpartition of each tenant (or group of indexes), etc. In some cases, thedata intake and query system 108 can include a single data store catalog220 that includes information about buckets associated with multiple orall of the tenants associated with the data intake and query system 108.

The indexing system 212 can update the data store catalog 220 as theindexing system 212 stores data in common storage 216. Furthermore, theindexing system 212 or other computing device associated with the datastore catalog 220 can update the data store catalog 220 as theinformation in the common storage 216 changes (e.g., as buckets incommon storage 216 are merged, deleted, etc.). In addition, as describedherein, the query system 214 can use the data store catalog 220 toidentify data to be searched or data that satisfies at least a portionof a query. In some embodiments, the query system 214 makes requests toand receives data from the data store catalog 220 using an applicationprogramming interface (“API”).

The query acceleration data store 222 can store the results or partialresults of queries, or otherwise be used to accelerate queries. Forexample, if a user submits a query that has no end date, the system canquery system 214 can store an initial set of results in the queryacceleration data store 222. As additional query results are determinedbased on additional data, the additional results can be combined withthe initial set of results, and so on. In this way, the query system 214can avoid re-searching all of the data that may be responsive to thequery and instead search the data that has not already been searched.

In some environments, a user of a data intake and query system 108 mayinstall and configure, on computing devices owned and operated by theuser, one or more software applications that implement some or all ofthese system components. For example, a user may install a softwareapplication on server computers owned by the user and configure eachserver to operate as one or more of intake system 210, indexing system212, query system 214, common storage 216, data store catalog 220, orquery acceleration data store 222, etc. This arrangement generally maybe referred to as an “on-premises” solution. That is, the system 108 isinstalled and operates on computing devices directly controlled by theuser of the system. Some users may prefer an on-premises solutionbecause it may provide a greater level of control over the configurationof certain aspects of the system (e.g., security, privacy, standards,controls, etc.). However, other users may instead prefer an arrangementin which the user is not directly responsible for providing and managingthe computing devices upon which various components of system 108operate.

In certain embodiments, one or more of the components of a data intakeand query system 108 can be implemented in a remote distributedcomputing system. In this context, a remote distributed computing systemor cloud-based service can refer to a service hosted by one morecomputing resources that are accessible to end users over a network, forexample, by using a web browser or other application on a client deviceto interface with the remote computing resources. For example, a serviceprovider may provide a data intake and query system 108 by managingcomputing resources configured to implement various aspects of thesystem (e.g., intake system 210, indexing system 212, query system 214,common storage 216, data store catalog 220, or query acceleration datastore 222, etc.) and by providing access to the system to end users viaa network. Typically, a user may pay a subscription or other fee to usesuch a service. Each subscribing user of the cloud-based service may beprovided with an account that enables the user to configure a customizedcloud-based system based on the user's preferences. When implemented asa cloud-based service, various components of the system 108 can beimplemented using containerization or operating-system-levelvirtualization, or other virtualization technique. For example, one ormore components of the intake system 210, indexing system 212, or querysystem 214 can be implemented as separate software containers orcontainer instances. Each container instance can have certain resources(e.g., memory, processor, etc.) of the underlying host computing systemassigned to it, but may share the same operating system and may use theoperating system's system call interface. Each container may provide anisolated execution environment on the host system, such as by providinga memory space of the host system that is logically isolated from memoryspace of other containers. Further, each container may run the same ordifferent computer applications concurrently or separately, and mayinteract with each other. Although reference is made herein tocontainerization and container instances, it will be understood thatother virtualization techniques can be used. For example, the componentscan be implemented using virtual machines using full virtualization orparavirtualization, etc. Thus, where reference is made to“containerized” components, it should be understood that such componentsmay additionally or alternatively be implemented in other isolatedexecution environments, such as a virtual machine environment.

3.1. Intake System

As detailed below, data may be ingested at the data intake and querysystem 108 through an intake system 210 configured to conductpreliminary processing on the data, and make the data available todownstream systems or components, such as the indexing system 212, querysystem 214, third party systems, etc.

One example configuration of an intake system 210 is shown in FIG. 3A.As shown in FIG. 3A, the intake system 210 includes a forwarder 302, adata retrieval subsystem 304, an intake ingestion buffer 306, astreaming data processor 308, and an output ingestion buffer 310. Asdescribed in detail below, the components of the intake system 210 maybe configured to process data according to a streaming data model, suchthat data ingested into the data intake and query system 108 isprocessed rapidly (e.g., within seconds or minutes of initial receptionat the intake system 210) and made available to downstream systems orcomponents. The initial processing of the intake system 210 may includesearch or analysis of the data ingested into the intake system 210. Forexample, the initial processing can transform data ingested into theintake system 210 sufficiently, for example, for the data to be searchedby a query system 214, thus enabling “real-time” searching for data onthe data intake and query system 108 (e.g., without requiring indexingof the data). Various additional and alternative uses for data processedby the intake system 210 are described below.

Although shown as separate components, the forwarder 302, data retrievalsubsystem 304, intake ingestion buffer 306, streaming data processors308, and output ingestion buffer 310, in various embodiments, may resideon the same machine or be distributed across multiple machines in anycombination. In one embodiment, any or all of the components of theintake system can be implemented using one or more computing devices asdistinct computing devices or as one or more container instances orvirtual machines across one or more computing devices. It will beappreciated by those skilled in the art that the intake system 210 mayhave more of fewer components than are illustrated in FIGS. 3A and 3B.In addition, the intake system 210 could include various web servicesand/or peer-to-peer network configurations or inter containercommunication network provided by an associated container instantiationor orchestration platform. Thus, the intake system 210 of FIGS. 3A and3B should be taken as illustrative. For example, in some embodiments,components of the intake system 210, such as the ingestion buffers 306and 310 and/or the streaming data processors 308, may be executed by onemore virtual machines implemented in a hosted computing environment. Ahosted computing environment may include one or more rapidly provisionedand released computing resources, which computing resources may includecomputing, networking and/or storage devices. A hosted computingenvironment may also be referred to as a cloud computing environment.Accordingly, the hosted computing environment can include anyproprietary or open source extensible computing technology, such asApache Flink or Apache Spark, to enable fast or on-demand horizontalcompute capacity scaling of the streaming data processor 308.

In some embodiments, some or all of the elements of the intake system210 (e.g., forwarder 302, data retrieval subsystem 304, intake ingestionbuffer 306, streaming data processors 308, and output ingestion buffer310, etc.) may reside on one or more computing devices, such as servers,which may be communicatively coupled with each other and with the datasources 202, query system 214, indexing system 212, or other components.In other embodiments, some or all of the elements of the intake system210 may be implemented as worker nodes as disclosed in U.S. patentapplication Ser. Nos. 15/665,159, 15/665,148, 15/665,187, 15/665,248,15/665,197, 15/665,279, 15/665,302, and 15/665,339, each of which isincorporated by reference herein in its entirety (hereinafter referredto as “the Parent Applications”).

As noted above, the intake system 210 can function to conductpreliminary processing of data ingested at the data intake and querysystem 108. As such, the intake system 210 illustratively includes aforwarder 302 that obtains data from a data source 202 and transmits thedata to a data retrieval subsystem 304. The data retrieval subsystem 304may be configured to convert or otherwise format data provided by theforwarder 302 into an appropriate format for inclusion at the intakeingestion buffer and transmit the message to the intake ingestion buffer306 for processing. Thereafter, a streaming data processor 308 mayobtain data from the intake ingestion buffer 306, process the dataaccording to one or more rules, and republish the data to either theintake ingestion buffer 306 (e.g., for additional processing) or to theoutput ingestion buffer 310, such that the data is made available todownstream components or systems. In this manner, the intake system 210may repeatedly or iteratively process data according to any of a varietyof rules, such that the data is formatted for use on the data intake andquery system 108 or any other system. As discussed below, the intakesystem 210 may be configured to conduct such processing rapidly (e.g.,in “real-time” with little or no perceptible delay), while ensuringresiliency of the data.

3.1.1. Forwarder

The forwarder 302 can include or be executed on a computing deviceconfigured to obtain data from a data source 202 and transmit the datato the data retrieval subsystem 304. In some implementations theforwarder 302 can be installed on a computing device associated with thedata source 202. While a single forwarder 302 is illustratively shown inFIG. 3A, the intake system 210 may include a number of differentforwarders 302. Each forwarder 302 may illustratively be associated witha different data source 202. A forwarder 302 initially may receive thedata as a raw data stream generated by the data source 202. For example,a forwarder 302 may receive a data stream from a log file generated byan application server, from a stream of network data from a networkdevice, or from any other source of data. In some embodiments, aforwarder 202 receives the raw data and may segment the data stream into“blocks”, possibly of a uniform data size, to facilitate subsequentprocessing steps. The forwarder 202 may additionally or alternativelymodify data received, prior to forwarding the data to the data retrievalsubsystem 304. Illustratively, the forwarder 202 may “tag” metadata foreach data block, such as by specifying a source, source type, or hostassociated with the data, or by appending one or more timestamp or timeranges to each data block.

In some embodiments, a forwarder 302 may comprise a service accessibleto data sources 202 via a network 206. For example, one type offorwarder 302 may be capable of consuming vast amounts of real-time datafrom a potentially large number of data sources 202. The forwarder 302may, for example, comprise a computing device which implements multipledata pipelines or “queues” to handle forwarding of network data to dataretrieval subsystems 304.

3.1.2. Data Retrieval Subsystem

The data retrieval subsystem 304 illustratively corresponds to acomputing device which obtains data (e.g., from the forwarder 302), andtransforms the data into a format suitable for publication on the intakeingestion buffer 306. Illustratively, where the forwarder 302 segmentsinput data into discrete blocks, the data retrieval subsystem 304 maygenerate a message for each block, and publish the message to the intakeingestion buffer 306. Generation of a message for each block mayinclude, for example, formatting the data of the message in accordancewith the requirements of a streaming data system implementing the intakeingestion buffer 306, the requirements of which may vary according tothe streaming data system. In one embodiment, the intake ingestionbuffer 306 formats messages according to the protocol buffers method ofserializing structured data. Thus, the intake ingestion buffer 306 maybe configured to convert data from an input format into a protocolbuffer format. Where a forwarder 302 does not segment input data intodiscrete blocks, the data retrieval subsystem 304 may itself segment thedata. Similarly, the data retrieval subsystem 304 may append metadata tothe input data, such as a source, source type, or host associated withthe data.

Generation of the message may include “tagging” the message with variousinformation, which may be included as metadata for the data provided bythe forwarder 302, and determining a “topic” for the message, underwhich the message should be published to the intake ingestion buffer306. In general, the “topic” of a message may reflect a categorizationof the message on a streaming data system. Illustratively, each topicmay be associated with a logically distinct queue of messages, such thata downstream device or system may “subscribe” to the topic in order tobe provided with messages published to the topic on the streaming datasystem.

In one embodiment, the data retrieval subsystem 304 may obtain a set oftopic rules (e.g., provided by a user of the data intake and querysystem 108 or based on automatic inspection or identification of thevarious upstream and downstream components of the data intake and querysystem 108) that determine a topic for a message as a function of thereceived data or metadata regarding the received data. For example, thetopic of a message may be determined as a function of the data source202 from which the data stems. After generation of a message based oninput data, the data retrieval subsystem can publish the message to theintake ingestion buffer 306 under the determined topic.

While the data retrieval and subsystem 304 is depicted in FIG. 3A asobtaining data from the forwarder 302, the data retrieval and subsystem304 may additionally or alternatively obtain data from other sources. Insome instances, the data retrieval and subsystem 304 may be implementedas a plurality of intake points, each functioning to obtain data fromone or more corresponding data sources (e.g., the forwarder 302, datasources 202, or any other data source), generate messages correspondingto the data, determine topics to which the messages should be published,and to publish the messages to one or more topics of the intakeingestion buffer 306.

One illustrative set of intake points implementing the data retrievaland subsystem 304 is shown in FIG. 3B. Specifically, as shown in FIG.3B, the data retrieval and subsystem 304 of FIG. 3A may be implementedas a set of push-based publishers 320 or a set of pull-based publishers330. The illustrative push-based publishers 320 operate on a “push”model, such that messages are generated at the push-based publishers 320and transmitted to an intake ingestion buffer 306 (shown in FIG. 3B asprimary and secondary intake ingestion buffers 306A and 306B, which arediscussed in more detail below). As will be appreciated by one skilledin the art, “push” data transmission models generally correspond tomodels in which a data source determines when data should be transmittedto a data target. A variety of mechanisms exist to provide “push”functionality, including “true push” mechanisms (e.g., where a datasource independently initiates transmission of information) and“emulated push” mechanisms, such as “long polling” (a mechanism wherebya data target initiates a connection with a data source, but allows thedata source to determine within a timeframe when data is to betransmitted to the data source).

As shown in FIG. 3B, the push-based publishers 320 illustrativelyinclude an HTTP intake point 322 and a data intake and query system(DIQS) intake point 324. The HTTP intake point 322 can include acomputing device configured to obtain HTTP-based data (e.g., asJavaScript Object Notation, or JSON messages) to format the HTTP-baseddata as a message, to determine a topic for the message (e.g., based onfields within the HTTP-based data), and to publish the message to theprimary intake ingestion buffer 306A. Similarly, the DIQS intake point324 can be configured to obtain data from a forwarder 324, to format theforwarder data as a message, to determine a topic for the message, andto publish the message to the primary intake ingestion buffer 306A. Inthis manner, the DIQS intake point 324 can function in a similar mannerto the operations described with respect to the data retrieval subsystem304 of FIG. 3A.

In addition to the push-based publishers 320, one or more pull-basedpublishers 330 may be used to implement the data retrieval subsystem304. The pull-based publishers 330 may function on a “pull” model,whereby a data target (e.g., the primary intake ingestion buffer 306A)functions to continuously or periodically (e.g., each n seconds) querythe pull-based publishers 330 for new messages to be placed on theprimary intake ingestion buffer 306A. In some instances, development ofpull-based systems may require less coordination of functionalitybetween a pull-based publisher 330 and the primary intake ingestionbuffer 306A. Thus, for example, pull-based publishers 330 may be morereadily developed by third parties (e.g., other than a developer of thedata intake a query system 108), and enable the data intake and querysystem 108 to ingest data associated with third party data sources 202.Accordingly, FIG. 3B includes a set of custom intake points 332A through332N, each of which functions to obtain data from a third-party datasource 202, format the data as a message for inclusion in the primaryintake ingestion buffer 306A, determine a topic for the message, andmake the message available to the primary intake ingestion buffer 306Ain response to a request (a “pull”) for such messages.

While the pull-based publishers 330 are illustratively described asdeveloped by third parties, push-based publishers 320 may also in someinstances be developed by third parties. Additionally or alternatively,pull-based publishers may be developed by the developer of the dataintake and query system 108. To facilitate integration of systemspotentially developed by disparate entities, the primary intakeingestion buffer 306A may provide an API through which an intake pointmay publish messages to the primary intake ingestion buffer 306A.Illustratively, the API may enable an intake point to “push” messages tothe primary intake ingestion buffer 306A, or request that the primaryintake ingestion buffer 306A “pull” messages from the intake point.Similarly, the streaming data processors 308 may provide an API throughwhich ingestions buffers may register with the streaming data processors308 to facilitate pre-processing of messages on the ingestion buffers,and the output ingestion buffer 310 may provide an API through which thestreaming data processors 308 may publish messages or through whichdownstream devices or systems may subscribe to topics on the outputingestion buffer 310. Furthermore, any one or more of the intake points322 through 332N may provide an API through which data sources 202 maysubmit data to the intake points. Thus, any one or more of thecomponents of FIGS. 3A and 3B may be made available via APIs to enableintegration of systems potentially provided by disparate parties.

The specific configuration of publishers 320 and 330 shown in FIG. 3B isintended to be illustrative in nature. For example, the specific numberand configuration of intake points may vary according to embodiments ofthe present application. In some instances, one or more components ofthe intake system 210 may be omitted. For example, a data source 202 mayin some embodiments publish messages to an intake ingestion buffer 306,and thus an intake point 332 may be unnecessary. Other configurations ofthe intake system 210 are possible.

3.1.3. Ingestion Buffer

The intake system 210 is illustratively configured to ensure messageresiliency, such that data is persisted in the event of failures withinthe intake system 310. Specifically, the intake system 210 may utilizeone or more ingestion buffers, which operate to resiliently maintaindata received at the intake system 210 until the data is acknowledged bydownstream systems or components. In one embodiment, resiliency isprovided at the intake system 210 by use of ingestion buffers thatoperate according to a publish-subscribe (“pub-sub”) message model. Inaccordance with the pub-sub model, data ingested into the data intakeand query system 108 may be atomized as “messages,” each of which iscategorized into one or more “topics.” An ingestion buffer can maintaina queue for each such topic, and enable devices to “subscribe” to agiven topic. As messages are published to the topic, the ingestionbuffer can function to transmit the messages to each subscriber, andensure message resiliency until at least each subscriber hasacknowledged receipt of the message (e.g., at which point the ingestionbuffer may delete the message). In this manner, the ingestion buffer mayfunction as a “broker” within the pub-sub model. A variety of techniquesto ensure resiliency at a pub-sub broker are known in the art, and thuswill not be described in detail herein. In one embodiment, an ingestionbuffer is implemented by a streaming data source. As noted above,examples of streaming data sources include (but are not limited to)Amazon's Simple Queue Service (“SQS”) or Kinesis™ services, devicesexecuting Apache Kafka™ software, or devices implementing the MessageQueue Telemetry Transport (MQTT) protocol. Any one or more of theseexample streaming data sources may be utilized to implement an ingestionbuffer in accordance with embodiments of the present disclosure.

With reference to FIG. 3A, the intake system 210 may include at leasttwo logical ingestion buffers: an intake ingestion buffer 306 and anoutput ingestion buffer 310. As noted above, the intake ingestion buffer306 can be configured to receive messages from the data retrievalsubsystem 304 and resiliently store the message. The intake ingestionbuffer 306 can further be configured to transmit the message to thestreaming data processors 308 for processing. As further describedbelow, the streaming data processors 308 can be configured with one ormore data transformation rules to transform the messages, and republishthe messages to one or both of the intake ingestion buffer 306 and theoutput ingestion buffer 310. The output ingestion buffer 310, in turn,may make the messages available to various subscribers to the outputingestion buffer 310, which subscribers may include the query system214, the indexing system 212, or other third-party devices (e.g., clientdevices 102, host devices 106, etc.).

Both the input ingestion buffer 306 and output ingestion buffer 310 maybe implemented on a streaming data source, as noted above. In oneembodiment, the intake ingestion buffer 306 operates to maintainsource-oriented topics, such as topics for each data source 202 fromwhich data is obtained, while the output ingestion buffer operates tomaintain content-oriented topics, such as topics to which the data of anindividual message pertains. As discussed in more detail below, thestreaming data processors 308 can be configured to transform messagesfrom the intake ingestion buffer 306 (e.g., arranged according tosource-oriented topics) and publish the transformed messages to theoutput ingestion buffer 310 (e.g., arranged according tocontent-oriented topics). In some instances, the streaming dataprocessors 308 may additionally or alternatively republish transformedmessages to the intake ingestion buffer 306, enabling iterative orrepeated processing of the data within the message by the streaming dataprocessors 308.

While shown in FIG. 3A as distinct, these ingestion buffers 306 and 310may be implemented as a common ingestion buffer. However, use ofdistinct ingestion buffers may be beneficial, for example, where ageographic region in which data is received differs from a region inwhich the data is desired. For example, use of distinct ingestionbuffers may beneficially allow the intake ingestion buffer 306 tooperate in a first geographic region associated with a first set of dataprivacy restrictions, while the output ingestion buffer 308 operates ina second geographic region associated with a second set of data privacyrestrictions. In this manner, the intake system 210 can be configured tocomply with all relevant data privacy restrictions, ensuring privacy ofdata processed at the data intake and query system 108.

Moreover, either or both of the ingestion buffers 306 and 310 may beimplemented across multiple distinct devices, as either a single ormultiple ingestion buffers. Illustratively, as shown in FIG. 3B, theintake system 210 may include both a primary intake ingestion buffer306A and a secondary intake ingestion buffer 306B. The primary intakeingestion buffer 306A is illustratively configured to obtain messagesfrom the data retrieval subsystem 304 (e.g., implemented as a set ofintake points 322 through 332N). The secondary intake ingestion buffer306B is illustratively configured to provide an additional set ofmessages (e.g., from other data sources 202). In one embodiment, theprimary intake ingestion buffer 306A is provided by an administrator ordeveloper of the data intake and query system 108, while the secondaryintake ingestion buffer 306B is a user-supplied ingestion buffer (e.g.,implemented externally to the data intake and query system 108).

As noted above, an intake ingestion buffer 306 may in some embodimentscategorize messages according to source-oriented topics (e.g., denotinga data source 202 from which the message was obtained). In otherembodiments, an intake ingestion buffer 306 may in some embodimentscategorize messages according to intake-oriented topics (e.g., denotingthe intake point from which the message was obtained). The number andvariety of such topics may vary, and thus are not shown in FIG. 3B. Inone embodiment, the intake ingestion buffer 306 maintains only a singletopic (e.g., all data to be ingested at the data intake and query system108).

The output ingestion buffer 310 may in one embodiment categorizemessages according to content-centric topics (e.g., determined based onthe content of a message). Additionally or alternatively, the outputingestion buffer 310 may categorize messages according toconsumer-centric topics (e.g., topics intended to store messages forconsumption by a downstream device or system). An illustrative number oftopics are shown in FIG. 3B, as topics 342 through 352N. Each topic maycorrespond to a queue of messages (e.g., in accordance with the pub-submodel) relevant to the corresponding topic. As described in more detailbelow, the streaming data processors 308 may be configured to processmessages from the intake ingestion buffer 306 and determine which topicsof the topics 342 through 352N into which to place the messages. Forexample, the index topic 342 may be intended to store messages holdingdata that should be consumed and indexed by the indexing system 212. Thenotable event topic 344 may be intended to store messages holding datathat indicates a notable event at a data source 202 (e.g., theoccurrence of an error or other notable event). The metrics topic 346may be intended to store messages holding metrics data for data sources202. The search results topic 348 may be intended to store messagesholding data responsive to a search query. The mobile alerts topic 350may be intended to store messages holding data for which an end user hasrequested alerts on a mobile device. A variety of custom topics 352Athrough 352N may be intended to hold data relevant to end-user-createdtopics.

As will be described below, by application of message transformationrules at the streaming data processors 308, the intake system 210 maydivide and categorize messages from the intake ingestion buffer 306,partitioning the message into output topics relevant to a specificdownstream consumer. In this manner, specific portions of data input tothe data intake and query system 108 may be “divided out” and handledseparately, enabling different types of data to be handled differently,and potentially at different speeds. Illustratively, the index topic 342may be configured to include all or substantially all data included inthe intake ingestion buffer 306. Given the volume of data, there may bea significant delay (e.g., minutes or hours) before a downstreamconsumer (e.g., the indexing system 212) processes a message in theindex topic 342. Thus, for example, searching data processed by theindexing system 212 may incur significant delay.

Conversely, the search results topic 348 may be configured to hold onlymessages corresponding to data relevant to a current query.Illustratively, on receiving a query from a client device 204, the querysystem 214 may transmit to the intake system 210 a rule that detects,within messages from the intake ingestion buffer 306A, data potentiallyrelevant to the query. The streaming data processors 308 may republishthese messages within the search results topic 348, and the query system214 may subscribe to the search results topic 348 in order to obtain thedata within the messages. In this manner, the query system 214 can“bypass” the indexing system 212 and avoid delay that may be caused bythat system, thus enabling faster (and potentially real time) display ofsearch results.

While shown in FIGS. 3A and 3B as a single output ingestion buffer 310,the intake system 210 may in some instances utilize multiple outputingestion buffers 310.

3.1.4. Streaming Data Processors

As noted above, the streaming data processors 308 may apply one or morerules to process messages from the intake ingestion buffer 306A intomessages on the output ingestion buffer 310. These rules may bespecified, for example, by an end user of the data intake and querysystem 108 or may be automatically generated by the data intake andquery system 108 (e.g., in response to a user query).

Illustratively, each rule may correspond to a set of selection criteriaindicating messages to which the rule applies, as well as one or moreprocessing sub-rules indicating an action to be taken by the streamingdata processors 308 with respect to the message. The selection criteriamay include any number or combination of criteria based on the dataincluded within a message or metadata of the message (e.g., a topic towhich the message is published). In one embodiment, the selectioncriteria are formatted in the same manner or similarly to extractionrules, discussed in more detail below. For example, selection criteriamay include regular expressions that derive one or more values or asub-portion of text from the portion of machine data in each message toproduce a value for the field for that message. When a message islocated within the intake ingestion buffer 308 that matches theselection criteria, the streaming data processors 308 may apply theprocessing rules to the message. Processing sub-rules may indicate, forexample, a topic of the output ingestion buffer 310 into which themessage should be placed. Processing sub-rules may further indicatetransformations, such as field or unit normalization operations, to beperformed on the message. Illustratively, a transformation may includemodifying data within the message, such as altering a format in whichthe data is conveyed (e.g., converting millisecond timestamps values tomicrosecond timestamp values, converting imperial units to metric units,etc.), or supplementing the data with additional information (e.g.,appending an error descriptor to an error code). In some instances, thestreaming data processors 308 may be in communication with one or moreexternal data stores (the locations of which may be specified within arule) that provide information used to supplement or enrich messagesprocessed at the streaming data processors 308. For example, a specificrule may include selection criteria identifying an error code within amessage of the primary ingestion buffer 306A, and specifying that whenthe error code is detected within a message, that the streaming dataprocessors 308 should conduct a lookup in an external data source (e.g.,a database) to retrieve the human-readable descriptor for that errorcode, and inject the descriptor into the message. In this manner, rulesmay be used to process, transform, or enrich messages.

The streaming data processors 308 may include a set of computing devicesconfigured to process messages from the intake ingestion buffer 306 at aspeed commensurate with a rate at which messages are placed into theintake ingestion buffer 306. In one embodiment, the number of streamingdata processors 308 used to process messages may vary based on a numberof messages on the intake ingestion buffer 306 awaiting processing.Thus, as additional messages are queued into the intake ingestion buffer306, the number of streaming data processors 308 may be increased toensure that such messages are rapidly processed. In some instances, thestreaming data processors 308 may be extensible on a per topic basis.Thus, individual devices implementing the streaming data processors 308may subscribe to different topics on the intake ingestion buffer 306,and the number of devices subscribed to an individual topic may varyaccording to a rate of publication of messages to that topic (e.g., asmeasured by a backlog of messages in the topic). In this way, the intakesystem 210 can support ingestion of massive amounts of data fromnumerous data sources 202.

In some embodiments, an intake system may comprise a service accessibleto client devices 102 and host devices 106 via a network 104. Forexample, one type of forwarder may be capable of consuming vast amountsof real-time data from a potentially large number of client devices 102and/or host devices 106. The forwarder may, for example, comprise acomputing device which implements multiple data pipelines or “queues” tohandle forwarding of network data to indexers. A forwarder may alsoperform many of the functions that are performed by an indexer. Forexample, a forwarder may perform keyword extractions on raw data orparse raw data to create events. A forwarder may generate time stampsfor events. Additionally or alternatively, a forwarder may performrouting of events to indexers. Data store 212 may contain events derivedfrom machine data from a variety of sources all pertaining to the samecomponent in an IT environment, and this data may be produced by themachine in question or by other components in the IT environment.

3.2. Indexing System

FIG. 4 is a block diagram illustrating an embodiment of an indexingsystem 212 of the data intake and query system 108. The indexing system212 can receive, process, and store data from multiple data sources 202,which may be associated with different tenants, users, etc. Using thereceived data, the indexing system can generate events that include aportion of machine data associated with a timestamp and store the eventsin buckets based on one or more of the timestamps, tenants, indexes,etc., associated with the data. Moreover, the indexing system 212 caninclude various components that enable it to provide a statelessindexing service, or indexing service that is able to rapidly recoverwithout data loss if one or more components of the indexing system 212become unresponsive or unavailable.

In the illustrated embodiment, the indexing system 212 includes anindexing system manager 402 and one or more indexing nodes 404. However,it will be understood that the indexing system 212 can include fewer ormore components. For example, in some embodiments, the common storage216 or data store catalog 220 can form part of the indexing system 212,etc.

As described herein, each of the components of the indexing system 212can be implemented using one or more computing devices as distinctcomputing devices or as one or more container instances or virtualmachines across one or more computing devices. For example, in someembodiments, the indexing system manager 402 and indexing nodes 404 canbe implemented as distinct computing devices with separate hardware,memory, and processors. In certain embodiments, the indexing systemmanager 402 and indexing nodes 404 can be implemented on the same oracross different computing devices as distinct container instances, witheach container having access to a subset of the resources of a hostcomputing device (e.g., a subset of the memory or processing time of theprocessors of the host computing device), but sharing a similaroperating system. In some cases, the components can be implemented asdistinct virtual machines across one or more computing devices, whereeach virtual machine can have its own unshared operating system butshares the underlying hardware with other virtual machines on the samehost computing device.

3.2.1 Indexing System Manager

As mentioned, the indexing system manager 402 can monitor and manage theindexing nodes 404, and can be implemented as a distinct computingdevice, virtual machine, container, container of a pod, or a process orthread associated with a container. In certain embodiments, the indexingsystem 212 can include one indexing system manager 402 to manage allindexing nodes 404 of the indexing system 212. In some embodiments, theindexing system 212 can include multiple indexing system managers 402.For example, an indexing system manager 402 can be instantiated for eachcomputing device (or group of computing devices) configured as a hostcomputing device for multiple indexing nodes 404.

The indexing system manager 402 can handle resource management,creation/destruction of indexing nodes 404, high availability, loadbalancing, application upgrades/rollbacks, logging and monitoring,storage, networking, service discovery, and performance and scalability,and otherwise handle containerization management of the containers ofthe indexing system 212. In certain embodiments, the indexing systemmanager 402 can be implemented using Kubernetes or Swarm.

In some cases, the indexing system manager 402 can monitor the availableresources of a host computing device and request additional resources ina shared resource environment, based on workload of the indexing nodes404 or create, destroy, or reassign indexing nodes 404 based onworkload. Further, the indexing system manager 402 system can assignindexing nodes 404 to handle data streams based on workload, systemresources, etc.

3.2.2. Indexing Nodes

The indexing nodes 404 can include one or more components to implementvarious functions of the indexing system 212. In the illustratedembodiment, the indexing node 404 includes an indexing node manager 406,partition manager 408, indexer 410, data store 412, and bucket manager414. As described herein, the indexing nodes 404 can be implemented onseparate computing devices or as containers or virtual machines in avirtualization environment.

In some embodiments, an indexing node 404, and can be implemented as adistinct computing device, virtual machine, container, container of apod, or a process or thread associated with a container, or usingmultiple-related containers. In certain embodiments, such as in aKubernetes deployment, each indexing node 404 can be implemented as aseparate container or pod. For example, one or more of the components ofthe indexing node 404 can be implemented as different containers of asingle pod, e.g., on a containerization platform, such as Docker, theone or more components of the indexing node can be implemented asdifferent Docker containers managed by synchronization platforms such asKubernetes or Swarm. Accordingly, reference to a containerized indexingnode 404 can refer to the indexing node 404 as being a single containeror as one or more components of the indexing node 404 being implementedas different, related containers or virtual machines.

3.2.2.1. Indexing Node Manager

The indexing node manager 406 can manage the processing of the variousstreams or partitions of data by the indexing node 404, and can beimplemented as a distinct computing device, virtual machine, container,container of a pod, or a process or thread associated with a container.For example, in certain embodiments, as partitions or data streams areassigned to the indexing node 404, the indexing node manager 406 cangenerate one or more partition manager(s) 408 to manage each partitionor data stream. In some cases, the indexing node manager 406 generates aseparate partition manager 408 for each partition or shard that isprocessed by the indexing node 404. In certain embodiments, thepartition can correspond to a topic of a data stream of the ingestionbuffer 310. Each topic can be configured in a variety of ways. Forexample, in some embodiments, a topic may correspond to data from aparticular data source 202, tenant, index/partition, or sourcetype. Inthis way, in certain embodiments, the indexing system 212 candiscriminate between data from different sources or associated withdifferent tenants, or indexes/partitions. For example, the indexingsystem 212 can assign more indexing nodes 404 to process data from onetopic (associated with one tenant) than another topic (associated withanother tenant), or store the data from one topic more frequently tocommon storage 216 than the data from a different topic, etc.

In some embodiments, the indexing node manager 406 monitors the variousshards of data being processed by the indexing node 404 and the readpointers or location markers for those shards. In some embodiments, theindexing node manager 406 stores the read pointers or location marker inone or more data stores, such as but not limited to, common storage 216,DynamoDB, S3, or another type of storage system, shared storage system,or networked storage system, etc. As the indexing node 404 processes thedata and the markers for the shards are updated by the intake system210, the indexing node manager 406 can be updated to reflect the changesto the read pointers or location markers. In this way, if a particularpartition manager 408 becomes unresponsive or unavailable, the indexingnode manager 406 can generate a new partition manager 408 to handle thedata stream without losing context of what data is to be read from theintake system 210. Accordingly, in some embodiments, by using theingestion buffer 310 and tracking the location of the location markersin the shards of the ingestion buffer, the indexing system 212 can aidin providing a stateless indexing service.

In some embodiments, the indexing node manager 406 is implemented as abackground process, or daemon, on the indexing node 404 and thepartition manager(s) 408 are implemented as threads, copies, or forks ofthe background process. In some cases, an indexing node manager 406 cancopy itself, or fork, to create a partition manager 408 or cause atemplate process to copy itself, or fork, to create each new partitionmanager 408, etc. This may be done for multithreading efficiency or forother reasons related to containerization and efficiency of managingindexers 410. In certain embodiments, the indexing node manager 406generates a new process for each partition manager 408. In some cases,by generating a new process for each partition manager 408, the indexingnode manager 408 can support multiple language implementations and belanguage agnostic. For example, the indexing node manager 408 cangenerate a process for a partition manager 408 in python and create asecond process for a partition manager 408 in golang, etc.

3.2.2.2. Partition Manager

As mentioned, the partition manager(s) 408 can manage the processing ofone or more of the partitions or shards of a data stream processed by anindexing node 404 or the indexer 410 of the indexing node 404, and canbe implemented as a distinct computing device, virtual machine,container, container of a pod, or a process or thread associated with acontainer.

In some cases, managing the processing of a partition or shard caninclude, but it not limited to, communicating data from a particularshard to the indexer 410 for processing, monitoring the indexer 410 andthe size of the data being processed by the indexer 410, instructing theindexer 410 to move the data to common storage 216, and reporting thestorage of the data to the intake system 210. For a particular shard orpartition of data from the intake system 210, the indexing node manager406 can assign a particular partition manager 408. The partition manager408 for that partition can receive the data from the intake system 210and forward or communicate that data to the indexer 410 for processing.

In some embodiments, the partition manager 408 receives data from apub-sub messaging system, such as the ingestion buffer 310. As describedherein, the ingestion buffer 310 can have one or more streams of dataand one or more shards or partitions associated with each stream ofdata. Each stream of data can be separated into shards and/or otherpartitions or types of organization of data. In certain cases, eachshard can include data from multiple tenants, indexes/partition, etc. Insome cases, each shard can correspond to data associated with aparticular tenant, index/partition, source, sourcetype, etc.Accordingly, the indexing system 212 can include a partition manager 408for individual tenants, indexes/partitions, sources, sourcetypes, etc.In this way, the indexing system 212 can manage and process the datadifferently. For example, the indexing system 212 can assign moreindexing nodes 404 to process data from one tenant than another tenant,or store buckets associated with one tenant or partition/index morefrequently to common storage 216 than buckets associated with adifferent tenant or partition/index, etc.

Accordingly, in some embodiments, a partition manager 408 receives datafrom one or more of the shards or partitions of the ingestion buffer310. The partition manager 408 can forward the data from the shard tothe indexer 410 for processing. In some cases, the amount of data cominginto a shard may exceed the shard's throughput. For example, 4 MB/s ofdata may be sent to an ingestion buffer 310 for a particular shard, butthe ingestion buffer 310 may be able to process only 2 MB/s of data pershard. Accordingly, in some embodiments, the data in the shard caninclude a reference to a location in storage where the indexing system212 can retrieve the data. For example, a reference pointer to data canbe placed in the ingestion buffer 310 rather than putting the dataitself into the ingestion buffer. The reference pointer can reference achunk of data that is larger than the throughput of the ingestion buffer310 for that shard. In this way, the data intake and query system 108can increase the throughput of individual shards of the ingestion buffer310. In such embodiments, the partition manager 408 can obtain thereference pointer from the ingestion buffer 310 and retrieve the datafrom the referenced storage for processing. In some cases, thereferenced storage to which reference pointers in the ingestion buffer310 may point can correspond to the common storage 216 or other cloud orlocal storage. In some implementations, the chunks of data to which thereference pointers refer may be directed to common storage 216 fromintake system 210, e.g., streaming data processor 308 or ingestionbuffer 310.

As the indexer 410 processes the data, stores the data in buckets, andgenerates indexes of the data, the partition manager 408 can monitor theindexer 410 and the size of the data on the indexer 410 (inclusive ofthe data store 412) associated with the partition. The size of the dataon the indexer 410 can correspond to the data that is actually receivedfrom the particular partition of the intake system 210, as well as datagenerated by the indexer 410 based on the received data (e.g., invertedindexes, summaries, etc.), and may correspond to one or more buckets.For instance, the indexer 410 may have generated one or more buckets foreach tenant and/or partition associated with data being processed in theindexer 410.

Based on a bucket roll-over policy, the partition manager 408 caninstruct the indexer 410 to convert editable groups of data or bucketsto non-editable groups or buckets and/or copy the data associated withthe partition to common storage 216. In some embodiments, the bucketroll-over policy can indicate that the data associated with theparticular partition, which may have been indexed by the indexer 410 andstored in the data store 412 in various buckets, is to be copied tocommon storage 216 based on a determination that the size of the dataassociated with the particular partition satisfies a threshold size. Insome cases, the bucket roll-over policy can include different thresholdsizes for different partitions. In other implementations the bucketroll-over policy may be modified by other factors, such as an identityof a tenant associated with indexing node 404, system resource usage,which could be based on the pod or other container that containsindexing node 404, or one of the physical hardware layers with which theindexing node 404 is running, or any other appropriate factor forscaling and system performance of indexing nodes 404 or any other systemcomponent.

In certain embodiments, the bucket roll-over policy can indicate data isto be copied to common storage 216 based on a determination that theamount of data associated with all partitions (or a subset thereof) ofthe indexing node 404 satisfies a threshold amount. Further, the bucketroll-over policy can indicate that the one or more partition managers408 of an indexing node 404 are to communicate with each other or withthe indexing node manager 406 to monitor the amount of data on theindexer 410 associated with all of the partitions (or a subset thereof)assigned to the indexing node 404 and determine that the amount of dataon the indexer 410 (or data store 412) associated with all thepartitions (or a subset thereof) satisfies a threshold amount.Accordingly, based on the bucket roll-over policy, one or more of thepartition managers 408 or the indexing node manager 406 can instruct theindexer 410 to convert editable buckets associated with the partitions(or subsets thereof) to non-editable buckets and/or store the dataassociated with the partitions (or subset thereof) in common storage216.

In certain embodiments, the bucket roll-over policy can indicate thatbuckets are to be converted to non-editable buckets and stored in commonstorage based on a collective size of buckets satisfying a thresholdsize. In some cases, the bucket roll-over policy can use differentthreshold sizes for conversion and storage. For example, the bucketroll-over policy can use a first threshold size to indicate wheneditable buckets are to be converted to non-editable buckets (e.g., stopwriting to the buckets) and a second threshold size to indicate when thedata (or buckets) are to be stored in common storage 216. In certaincases, the bucket roll-over policy can indicate that the partitionmanager(s) 408 are to send a single command to the indexer 410 thatcauses the indexer 410 to convert editable buckets to non-editablebuckets and store the buckets in common storage 216.

Based on an acknowledgement that the data associated with a partition(or multiple partitions as the case may be) has been stored in commonstorage 216, the partition manager 408 can communicate to the intakesystem 210, either directly, or through the indexing node manager 406,that the data has been stored and/or that the location marker or readpointer can be moved or updated. In some cases, the partition manager408 receives the acknowledgement that the data has been stored fromcommon storage 216 and/or from the indexer 410. In certain embodiments,which will be described in more detail herein, the intake system 210does not receive communication that the data stored in intake system 210has been read and processed until after that data has been stored incommon storage 216.

The acknowledgement that the data has been stored in common storage 216can also include location information about the data within the commonstorage 216. For example, the acknowledgement can provide a link, map,or path to the copied data in the common storage 216. Using theinformation about the data stored in common storage 216, the partitionmanager 408 can update the data store catalog 220. For example, thepartition manager 408 can update the data store catalog 220 with anidentifier of the data (e.g., bucket identifier, tenant identifier,partition identifier, etc.), the location of the data in common storage216, a time range associated with the data, etc. In this way, the datastore catalog 220 can be kept up-to-date with the contents of the commonstorage 216.

Moreover, as additional data is received from the intake system 210, thepartition manager 408 can continue to communicate the data to theindexer 410, monitor the size or amount of data on the indexer 410,instruct the indexer 410 to copy the data to common storage 216,communicate the successful storage of the data to the intake system 210,and update the data store catalog 220.

As a non-limiting example, consider the scenario in which the intakesystem 210 communicates data from a particular shard or partition to theindexing system 212. The intake system 210 can track which data it hassent and a location marker for the data in the intake system 210 (e.g.,a marker that identifies data that has been sent to the indexing system212 for processing).

As described herein, the intake system 210 can retain or persistentlymake available the sent data until the intake system 210 receives anacknowledgement from the indexing system 212 that the sent data has beenprocessed, stored in persistent storage (e.g., common storage 216), oris safe to be deleted. In this way, if an indexing node 404 assigned toprocess the sent data becomes unresponsive or is lost, e.g., due to ahardware failure or a crash of the indexing node manager 406 or othercomponent, process, or daemon, the data that was sent to theunresponsive indexing node 404 will not be lost. Rather, a differentindexing node 404 can obtain and process the data from the intake system210.

As the indexing system 212 stores the data in common storage 216, it canreport the storage to the intake system 210. In response, the intakesystem 210 can update its marker to identify different data that hasbeen sent to the indexing system 212 for processing, but has not yetbeen stored. By moving the marker, the intake system 210 can indicatethat the previously-identified data has been stored in common storage216, can be deleted from the intake system 210 or, otherwise, can beallowed to be overwritten, lost, etc.

With reference to the example above, in some embodiments, the indexingnode manager 406 can track the marker used by the ingestion buffer 310,and the partition manager 408 can receive the data from the ingestionbuffer 310 and forward it to an indexer 410 for processing (or use thedata in the ingestion buffer to obtain data from a referenced storagelocation and forward the obtained data to the indexer). The partitionmanager 408 can monitor the amount of data being processed and instructthe indexer 410 to copy the data to common storage 216. Once the data isstored in common storage 216, the partition manager 408 can report thestorage to the ingestion buffer 310, so that the ingestion buffer 310can update its marker. In addition, the indexing node manager 406 canupdate its records with the location of the updated marker. In this way,if partition manager 408 become unresponsive or fails, the indexing nodemanager 406 can assign a different partition manager 408 to obtain thedata from the data stream without losing the location information, or ifthe indexer 410 becomes unavailable or fails, the indexing node manager406 can assign a different indexer 410 to process and store the data.

3.2.2.3. Indexer and Data Store

As described herein, the indexer 410 can be the primary indexingexecution engine, and can be implemented as a distinct computing device,container, container within a pod, etc. For example, the indexer 410 cantasked with parsing, processing, indexing, and storing the data receivedfrom the intake system 210 via the partition manager(s) 408.Specifically, in some embodiments, the indexer 410 can parse theincoming data to identify timestamps, generate events from the incomingdata, group and save events into buckets, generate summaries or indexes(e.g., time series index, inverted index, keyword index, etc.) of theevents in the buckets, and store the buckets in common storage 216.

In some cases, one indexer 410 can be assigned to each partition manager408, and in certain embodiments, one indexer 410 can receive and processthe data from multiple (or all) partition mangers 408 on the sameindexing node 404 or from multiple indexing nodes 404.

In some embodiments, the indexer 410 can store the events and buckets inthe data store 412 according to a bucket creation policy. The bucketcreation policy can indicate how many buckets the indexer 410 is togenerate for the data that it processes. In some cases, based on thebucket creation policy, the indexer 410 generates at least one bucketfor each tenant and index (also referred to as a partition) associatedwith the data that it processes. For example, if the indexer 410receives data associated with three tenants A, B, C, each with twoindexes X, Y, then the indexer 410 can generate at least six buckets: atleast one bucket for each of Tenant A::Index X, Tenant A::Index Y,Tenant B::Index X, Tenant B::Index Y, Tenant C::Index X, and TenantC::Index Y. Additional buckets may be generated for a tenant/partitionpair based on the amount of data received that is associated with thetenant/partition pair. However, it will be understood that the indexer410 can generate buckets using a variety of policies. For example, theindexer 410 can generate one or more buckets for each tenant, partition,source, sourcetype, etc.

In some cases, if the indexer 410 receives data that it determines to be“old,” e.g., based on a timestamp of the data or other temporaldetermination regarding the data, then it can generate a bucket for the“old” data. In some embodiments, the indexer 410 can determine that datais “old,” if the data is associated with a timestamp that is earlier intime by a threshold amount than timestamps of other data in thecorresponding bucket (e.g., depending on the bucket creation policy,data from the same partition and/or tenant) being processed by theindexer 410. For example, if the indexer 410 is processing data for thebucket for Tenant A::Index X having timestamps on 4/23 between 16:23:56and 16:46:32 and receives data for the Tenant A::Index X bucket having atimestamp on 4/22 or on 4/23 at 08:05:32, then it can determine that thedata with the earlier timestamps is “old” data and generate a new bucketfor that data. In this way, the indexer 410 can avoid placing data inthe same bucket that creates a time range that is significantly largerthan the time range of other buckets, which can decrease the performanceof the system as the bucket could be identified as relevant for a searchmore often than it otherwise would.

The threshold amount of time used to determine if received data is“old,” can be predetermined or dynamically determined based on a numberof factors, such as, but not limited to, time ranges of other buckets,amount of data being processed, timestamps of the data being processed,etc. For example, the indexer 410 can determine an average time range ofbuckets that it processes for different tenants and indexes. If incomingdata would cause the time range of a bucket to be significantly larger(e.g., 25%, 50%, 75%, double, or other amount) than the average timerange, then the indexer 410 can determine that the data is “old” data,and generate a separate bucket for it. By placing the “old” bucket in aseparate bucket, the indexer 410 can reduce the instances in which thebucket is identified as storing data that may be relevant to a query.For example, by having a smaller time range, the query system 214 mayidentify the bucket less frequently as a relevant bucket then if thebucket had the large time range due to the “old” data. Additionally, ina process that will be described in more detail herein, time-restrictedsearches and search queries may be executed more quickly because theremay be fewer buckets to search for a particular time range. In thismanner, computational efficiency of searching large amounts of data canbe improved. Although described with respect detecting “old” data, theindexer 410 can use similar techniques to determine that “new” datashould be placed in a new bucket or that a time gap between data in abucket and “new” data is larger than a threshold amount such that the“new” data should be stored in a separate bucket.

Once a particular bucket satisfies a size threshold, the indexer 410 canstore the bucket in or copy the bucket to common storage 216. In certainembodiments, the partition manager 408 can monitor the size of thebuckets and instruct the indexer 410 to copy the bucket to commonstorage 216. The threshold size can be predetermined or dynamicallydetermined.

In certain embodiments, the partition manager 408 can monitor the sizeof multiple, or all, buckets associated with the partition being managedby the partition manager 408, and based on the collective size of thebuckets satisfying a threshold size, instruct the indexer 410 to copythe buckets associated with the partition to common storage 216. Incertain cases, one or more partition managers 408 or the indexing nodemanager 406 can monitor the size of buckets across multiple, or allpartitions, associated with the indexing node 404, and instruct theindexer to copy the buckets to common storage 216 based on the size ofthe buckets satisfying a threshold size.

As described herein, buckets in the data store 412 that are being editedby the indexer 410 can be referred to as hot buckets or editablebuckets. For example, the indexer 410 can add data, events, and indexesto editable buckets in the data store 412, etc. Buckets in the datastore 412 that are no longer edited by the indexer 410 can be referredto as warm buckets or non-editable buckets. In some embodiments, oncethe indexer 410 determines that a hot bucket is to be copied to commonstorage 216, it can convert the hot (editable) bucket to a warm(non-editable) bucket, and then move or copy the warm bucket to thecommon storage 216. Once the warm bucket is moved or copied to commonstorage 216, the indexer 410 can notify the partition manager 408 thatthe data associated with the warm bucket has been processed and stored.As mentioned, the partition manager 408 can relay the information to theintake system 210. In addition, the indexer 410 can provide thepartition manager 408 with information about the buckets stored incommon storage 216, such as, but not limited to, location information,tenant identifier, index identifier, time range, etc. As describedherein, the partition manager 408 can use this information to update thedata store catalog 220.

3.2.3. Bucket Manager

The bucket manager 414 can manage the buckets stored in the data store412, and can be implemented as a distinct computing device, virtualmachine, container, container of a pod, or a process or threadassociated with a container. In some cases, the bucket manager 414 canbe implemented as part of the indexer 410, indexing node 404, or as aseparate component of the indexing system 212.

As described herein, the indexer 410 stores data in the data store 412as one or more buckets associated with different tenants, indexes, etc.In some cases, the contents of the buckets are not searchable by thequery system 214 until they are stored in common storage 216. Forexample, the query system 214 may be unable to identify data responsiveto a query that is located in hot (editable) buckets in the data store412 and/or the warm (non-editable) buckets in the data store 412 thathave not been copied to common storage 216. Thus, query results may beincomplete or inaccurate, or slowed as the data in the buckets of thedata store 412 are copied to common storage 216.

To decrease the delay between processing and/or indexing the data andmaking that data searchable, the indexing system 212 can use a bucketroll-over policy that instructs the indexer 410 to convert hot bucketsto warm buckets more frequently (or convert based on a smaller thresholdsize) and/or copy the warm buckets to common storage 216. Whileconverting hot buckets to warm buckets more frequently or based on asmaller storage size can decrease the lag between processing the dataand making it searchable, it can increase the storage size and overheadof buckets in common storage 216. For example, each bucket may haveoverhead associated with it, in terms of storage space required,processor power required, or other resource requirement. Thus, morebuckets in common storage 216 can result in more storage used foroverhead than for storing data, which can lead to increased storage sizeand costs. In addition, a larger number of buckets in common storage 216can increase query times, as the opening of each bucket as part of aquery can have certain processing overhead or time delay associated withit.

To decrease search times and reduce overhead and storage associated withthe buckets (while maintaining a reduced delay between processing thedata and making it searchable), the bucket manager 414 can monitor thebuckets stored in the data store 412 and/or common storage 216 and mergebuckets according to a bucket merge policy. For example, the bucketmanager 414 can monitor and merge warm buckets stored in the data store412 before, after, or concurrently with the indexer copying warm bucketsto common storage 216.

The bucket merge policy can indicate which buckets are candidates for amerge or which bucket to merge (e.g., based on time ranges, size,tenant/partition or other identifiers), the number of buckets to merge,size or time range parameters for the merged buckets, and/or a frequencyfor creating the merged buckets. For example, the bucket merge policycan indicate that a certain number of buckets are to be merged,regardless of size of the buckets. As another non-limiting example, thebucket merge policy can indicate that multiple buckets are to be mergeduntil a threshold bucket size is reached (e.g., 750 MB, or 1 GB, ormore). As yet another non-limiting example, the bucket merge policy canindicate that buckets having a time range within a set period of time(e.g., 30 sec, 1 min., etc.) are to be merged, regardless of the numberor size of the buckets being merged.

In addition, the bucket merge policy can indicate which buckets are tobe merged or include additional criteria for merging buckets. Forexample, the bucket merge policy can indicate that only buckets havingthe same tenant identifier and/or partition are to be merged, or setconstraints on the size of the time range for a merged bucket (e.g., thetime range of the merged bucket is not to exceed an average time rangeof buckets associated with the same source, tenant, partition, etc.). Incertain embodiments, the bucket merge policy can indicate that bucketsthat are older than a threshold amount (e.g., one hour, one day, etc.)are candidates for a merge or that a bucket merge is to take place oncean hour, once a day, etc. In certain embodiments, the bucket mergepolicy can indicate that buckets are to be merged based on adetermination that the number or size of warm buckets in the data store412 of the indexing node 404 satisfies a threshold number or size, orthe number or size of warm buckets associated with the same tenantidentifier and/or partition satisfies the threshold number or size. Itwill be understood, that the bucket manager 414 can use any one or anycombination of the aforementioned or other criteria for the bucket mergepolicy to determine when, how, and which buckets to merge.

Once a group of buckets are merged into one or more merged buckets, thebucket manager 414 can copy or instruct the indexer 406 to copy themerged buckets to common storage 216. Based on a determination that themerged buckets are successfully copied to the common storage 216, thebucket manager 414 can delete the merged buckets and the buckets used togenerate the merged buckets (also referred to herein as unmerged bucketsor pre-merged buckets) from the data store 412.

In some cases, the bucket manager 414 can also remove or instruct thecommon storage 216 to remove corresponding pre-merged buckets from thecommon storage 216 according to a bucket management policy. The bucketmanagement policy can indicate when the pre-merged buckets are to bedeleted or designated as able to be overwritten from common storage 216.

In some cases, the bucket management policy can indicate that thepre-merged buckets are to be deleted immediately, once any queriesrelying on the pre-merged buckets are completed, after a predeterminedamount of time, etc. In some cases, the pre-merged buckets may be in useor identified for use by one or more queries. Removing the pre-mergedbuckets from common storage 216 in the middle of a query may cause oneor more failures in the query system 214 or result in query responsesthat are incomplete or erroneous. Accordingly, the bucket managementpolicy, in some cases, can indicate to the common storage 216 thatqueries that arrive before a merged bucket is stored in common storage216 are to use the corresponding pre-merged buckets and queries thatarrive after the merged bucket is stored in common storage 216 are touse the merged bucket.

Further, the bucket management policy can indicate that once queriesusing the pre-merged buckets are completed, the buckets are to beremoved from common storage 216. However, it will be understood that thebucket management policy can indicate removal of the buckets in avariety of ways. For example, per the bucket management policy, thecommon storage 216 can remove the buckets after on one or more hours,one day, one week, etc., with or without regard to queries that may berelying on the pre-merged buckets. In some embodiments, the bucketmanagement policy can indicate that the pre-merged buckets are to beremoved without regard to queries relying on the pre-merged buckets andthat any queries relying on the pre-merged buckets are to be redirectedto the merged bucket.

In addition to removing the pre-merged buckets and merged bucket fromthe data store 412 and removing or instructing common storage 216 toremove the pre-merged buckets from the data store(s) 218, the bucketmanger 414 can update the data store catalog 220 or cause the indexer410 or partition manager 408 to update the data store catalog 220 withthe relevant changes. These changes can include removing reference tothe pre-merged buckets in the data store catalog 220 and/or addinginformation about the merged bucket, including, but not limited to, abucket, tenant, and/or partition identifier associated with the mergedbucket, a time range of the merged bucket, location information of themerged bucket in common storage 216, etc. In this way, the data storecatalog 220 can be kept up-to-date with the contents of the commonstorage 216.

3.3. Query System

FIG. 5 is a block diagram illustrating an embodiment of a query system214 of the data intake and query system 108. The query system 214 canreceive, process, and execute queries from multiple client devices 204,which may be associated with different tenants, users, etc. Moreover,the query system 214 can include various components that enable it toprovide a stateless or state-free search service, or search service thatis able to rapidly recover without data loss if one or more componentsof the query system 214 become unresponsive or unavailable.

In the illustrated embodiment, the query system 214 includes one or morequery system managers 502 (collectively or individually referred to asquery system manager 502), one or more search heads 504 (collectively orindividually referred to as search head 504 or search heads 504), one ormore search nodes 506 (collectively or individually referred to assearch node 506 or search nodes 506), a search node monitor 508, and asearch node catalog 510. However, it will be understood that the querysystem 214 can include fewer or more components as desired. For example,in some embodiments, the common storage 216, data store catalog 220, orquery acceleration data store 222 can form part of the query system 214,etc.

As described herein, each of the components of the query system 214 canbe implemented using one or more computing devices as distinct computingdevices or as one or more container instances or virtual machines acrossone or more computing devices. For example, in some embodiments, thequery system manager 502, search heads 504, and search nodes 506 can beimplemented as distinct computing devices with separate hardware,memory, and processors. In certain embodiments, the query system manager502, search heads 504, and search nodes 506 can be implemented on thesame or across different computing devices as distinct containerinstances, with each container having access to a subset of theresources of a host computing device (e.g., a subset of the memory orprocessing time of the processors of the host computing device), butsharing a similar operating system. In some cases, the components can beimplemented as distinct virtual machines across one or more computingdevices, where each virtual machine can have its own unshared operatingsystem but shares the underlying hardware with other virtual machines onthe same host computing device.

3.3.1. Query System Manager

As mentioned, the query system manager 502 can monitor and manage thesearch heads 504 and search nodes 506, and can be implemented as adistinct computing device, virtual machine, container, container of apod, or a process or thread associated with a container. For example,the query system manager 502 can determine which search head 504 is tohandle an incoming query or determine whether to generate an additionalsearch node 506 based on the number of queries received by the querysystem 214 or based on another search node 506 becoming unavailable orunresponsive. Similarly, the query system manager 502 can determine thatadditional search heads 504 should be generated to handle an influx ofqueries or that some search heads 504 can be de-allocated or terminatedbased on a reduction in the number of queries received.

In certain embodiments, the query system 214 can include one querysystem manager 502 to manage all search heads 504 and search nodes 506of the query system 214. In some embodiments, the query system 214 caninclude multiple query system managers 502. For example, a query systemmanager 502 can be instantiated for each computing device (or group ofcomputing devices) configured as a host computing device for multiplesearch heads 504 and/or search nodes 506.

Moreover, the query system manager 502 can handle resource management,creation, assignment, or destruction of search heads 504 and/or searchnodes 506, high availability, load balancing, applicationupgrades/rollbacks, logging and monitoring, storage, networking, servicediscovery, and performance and scalability, and otherwise handlecontainerization management of the containers of the query system 214.In certain embodiments, the query system manager 502 can be implementedusing Kubernetes or Swarm. For example, in certain embodiments, thequery system manager 502 may be part of a sidecar or sidecar container,that allows communication between various search nodes 506, varioussearch heads 504, and/or combinations thereof.

In some cases, the query system manager 502 can monitor the availableresources of a host computing device and/or request additional resourcesin a shared resource environment, based on workload of the search heads504 and/or search nodes 506 or create, destroy, or reassign search heads504 and/or search nodes 506 based on workload. Further, the query systemmanager 502 system can assign search heads 504 to handle incomingqueries and/or assign search nodes 506 to handle query processing basedon workload, system resources, etc.

3.3.2. Search Head

As described herein, the search heads 504 can manage the execution ofqueries received by the query system 214. For example, the search heads504 can parse the queries to identify the set of data to be processedand the manner of processing the set of data, identify the location ofthe data, identify tasks to be performed by the search head and tasks tobe performed by the search nodes 506, distribute the query (orsub-queries corresponding to the query) to the search nodes 506, applyextraction rules to the set of data to be processed, aggregate searchresults from the search nodes 506, store the search results in the queryacceleration data store 222, etc.

As described herein, the search heads 504 can be implemented on separatecomputing devices or as containers or virtual machines in avirtualization environment. In some embodiments, the search heads 504may be implemented using multiple-related containers. In certainembodiments, such as in a Kubernetes deployment, each search head 504can be implemented as a separate container or pod. For example, one ormore of the components of the search head 504 can be implemented asdifferent containers of a single pod, e.g., on a containerizationplatform, such as Docker, the one or more components of the indexingnode can be implemented as different Docker containers managed bysynchronization platforms such as Kubernetes or Swarm. Accordingly,reference to a containerized search head 504 can refer to the searchhead 504 as being a single container or as one or more components of thesearch head 504 being implemented as different, related containers.

In the illustrated embodiment, the search head 504 includes a searchmaster 512 and one or more search managers 514 to carry out its variousfunctions. However, it will be understood that the search head 504 caninclude fewer or more components as desired. For example, the searchhead 504 can include multiple search masters 512.

3.3.2.1. Search Master

The search master 512 can manage the execution of the various queriesassigned to the search head 504, and can be implemented as a distinctcomputing device, virtual machine, container, container of a pod, or aprocess or thread associated with a container. For example, in certainembodiments, as the search head 504 is assigned a query, the searchmaster 512 can generate one or more search manager(s) 514 to manage thequery. In some cases, the search master 512 generates a separate searchmanager 514 for each query that is received by the search head 504. Inaddition, once a query is completed, the search master 512 can handlethe termination of the corresponding search manager 514.

In certain embodiments, the search master 512 can track and store thequeries assigned to the different search managers 514. Accordingly, if asearch manager 514 becomes unavailable or unresponsive, the searchmaster 512 can generate a new search manager 514 and assign the query tothe new search manager 514. In this way, the search head 504 canincrease the resiliency of the query system 214, reduce delay caused byan unresponsive component, and can aid in providing a statelesssearching service.

In some embodiments, the search master 512 is implemented as abackground process, or daemon, on the search head 504 and the searchmanager(s) 514 are implemented as threads, copies, or forks of thebackground process. In some cases, a search master 512 can copy itself,or fork, to create a search manager 514 or cause a template process tocopy itself, or fork, to create each new search manager 514, etc., inorder to support efficient multithreaded implementations

3.3.2.2. Search Manager

As mentioned, the search managers 514 can manage the processing andexecution of the queries assigned to the search head 504, and can beimplemented as a distinct computing device, virtual machine, container,container of a pod, or a process or thread associated with a container.In some embodiments, one search manager 514 manages the processing andexecution of one query at a time. In such embodiments, if the searchhead 504 is processing one hundred queries, the search master 512 cangenerate one hundred search managers 514 to manage the one hundredqueries. Upon completing an assigned query, the search manager 514 canawait assignment to a new query or be terminated.

As part of managing the processing and execution of a query, and asdescribed herein, a search manager 514 can parse the query to identifythe set of data and the manner in which the set of data is to beprocessed (e.g., the transformations that are to be applied to the setof data), determine tasks to be performed by the search manager 514 andtasks to be performed by the search nodes 506, identify search nodes 506that are available to execute the query, map search nodes 506 to the setof data that is to be processed, instruct the search nodes 506 toexecute the query and return results, aggregate and/or transform thesearch results from the various search nodes 506, and provide the searchresults to a user and/or to the query acceleration data store 222.

In some cases, to aid in identifying the set of data to be processed,the search manager 514 can consult the data store catalog 220 (depictedin FIG. 2 ). As described herein, the data store catalog 220 can includeinformation regarding the data stored in common storage 216. In somecases, the data store catalog 220 can include bucket identifiers, a timerange, and a location of the buckets in common storage 216. In addition,the data store catalog 220 can include a tenant identifier and partitionidentifier for the buckets. This information can be used to identifybuckets that include data that satisfies at least a portion of thequery.

As a non-limiting example, consider a search manager 514 that has parseda query to identify the following filter criteria that is used toidentify the data to be processed: time range: past hour, partition:_sales, tenant: ABC, Inc., keyword: Error. Using the received filtercriteria, the search manager 514 can consult the data store catalog 220.Specifically, the search manager 514 can use the data store catalog 220to identify buckets associated with the _sales partition and the tenantABC, Inc. and that include data from the past hour. In some cases, thesearch manager 514 can obtain bucket identifiers and locationinformation from the data store catalog 220 for the buckets storing datathat satisfies at least the aforementioned filter criteria. In certainembodiments, if the data store catalog 220 includes keyword pairs, itcan use the keyword: Error to identify buckets that have at least oneevent that include the keyword Error.

Using the bucket identifiers and/or the location information, the searchmanager 514 can assign one or more search nodes 506 to search thecorresponding buckets. Accordingly, the data store catalog 220 can beused to identify relevant buckets and reduce the number of buckets thatare to be searched by the search nodes 506. In this way, the data storecatalog 220 can decrease the query response time of the data intake andquery system 108.

In some embodiments, the use of the data store catalog 220 to identifybuckets for searching can contribute to the statelessness of the querysystem 214 and search head 504. For example, if a search head 504 orsearch manager 514 becomes unresponsive or unavailable, the query systemmanager 502 or search master 512, as the case may be, can spin up orassign an additional resource (new search head 504 or new search manager514) to execute the query. As the bucket information is persistentlystored in the data store catalog 220, data lost due to theunavailability or unresponsiveness of a component of the query system214 can be recovered by using the bucket information in the data storecatalog 220.

In certain embodiments, to identify search nodes 506 that are availableto execute the query, the search manager 514 can consult the search nodecatalog 510. As described herein, the search node catalog 510 caninclude information regarding the search nodes 506. In some cases, thesearch node catalog 510 can include an identifier for each search node506, as well as utilization and availability information. For example,the search node catalog 510 can identify search nodes 506 that areinstantiated but are unavailable or unresponsive. In addition, thesearch node catalog 510 can identify the utilization rate of the searchnodes 506. For example, the search node catalog 510 can identify searchnodes 506 that are working at maximum capacity or at a utilization ratethat satisfies utilization threshold, such that the search node 506should not be used to execute additional queries for a time.

In addition, the search node catalog 510 can include architecturalinformation about the search nodes 506. For example, the search nodecatalog 510 can identify search nodes 506 that share a data store and/orare located on the same computing device, or on computing devices thatare co-located.

Accordingly, in some embodiments, based on the receipt of a query, asearch manager 514 can consult the search node catalog 510 for searchnodes 506 that are available to execute the received query. Based on theconsultation of the search node catalog 510, the search manager 514 candetermine which search nodes 506 to assign to execute the query.

The search manager 514 can map the search nodes 506 to the data that isto be processed according to a search node mapping policy. The searchnode mapping policy can indicate how search nodes 506 are to be assignedto data (e.g., buckets) and when search nodes 506 are to be assigned to(and instructed to search) the data or buckets.

In some cases, the search manager 514 can map the search nodes 506 tobuckets that include data that satisfies at least a portion of thequery. For example, in some cases, the search manager 514 can consultthe data store catalog 220 to obtain bucket identifiers of buckets thatinclude data that satisfies at least a portion of the query, e.g., as anon-limiting example, to obtain bucket identifiers of buckets thatinclude data associated with a particular time range. Based on theidentified buckets and search nodes 506, the search manager 514 candynamically assign (or map) search nodes 506 to individual bucketsaccording to a search node mapping policy.

In some embodiments, the search node mapping policy can indicate thatthe search manager 514 is to assign all buckets to search nodes 506 as asingle operation. For example, where ten buckets are to be searched byfive search nodes 506, the search manager 514 can assign two buckets toa first search node 506, two buckets to a second search node 506, etc.In another embodiment, the search node mapping policy can indicate thatthe search manager 514 is to assign buckets iteratively. For example,where ten buckets are to be searched by five search nodes 506, thesearch manager 514 can initially assign five buckets (e.g., one bucketsto each search node 506), and assign additional buckets to each searchnode 506 as the respective search nodes 506 complete the execution onthe assigned buckets.

Retrieving buckets from common storage 216 to be searched by the searchnodes 506 can cause delay or may use a relatively high amount of networkbandwidth or disk read/write bandwidth. In some cases, a local or shareddata store associated with the search nodes 506 may include a copy of abucket that was previously retrieved from common storage 216.Accordingly, to reduce delay caused by retrieving buckets from commonstorage 216, the search node mapping policy can indicate that the searchmanager 514 is to assign, preferably assign, or attempt to assign thesame search node 506 to search the same bucket over time. In this way,the assigned search node 506 can keep a local copy of the bucket on itsdata store (or a data store shared between multiple search nodes 506)and avoid the processing delays associated with obtaining the bucketfrom the common storage 216.

In certain embodiments, the search node mapping policy can indicate thatthe search manager 514 is to use a consistent hash function or otherfunction to consistently map a bucket to a particular search node 506.The search manager 514 can perform the hash using the bucket identifierobtained from the data store catalog 220, and the output of the hash canbe used to identify the search node 506 assigned to the bucket. In somecases, the consistent hash function can be configured such that evenwith a different number of search nodes 506 being assigned to executethe query, the output will consistently identify the same search node506, or have an increased probability of identifying the same searchnode 506.

In some embodiments, the query system 214 can store a mapping of searchnodes 506 to bucket identifiers. The search node mapping policy canindicate that the search manager 514 is to use the mapping to determinewhether a particular bucket has been assigned to a search node 506. Ifthe bucket has been assigned to a particular search node 506 and thatsearch node 506 is available, then the search manager 514 can assign thebucket to the search node 506. If the bucket has not been assigned to aparticular search node 506, the search manager 514 can use a hashfunction to identify a search node 506 for assignment. Once assigned,the search manager 514 can store the mapping for future use.

In certain cases, the search node mapping policy can indicate that thesearch manager 514 is to use architectural information about the searchnodes 506 to assign buckets. For example, if the identified search node506 is unavailable or its utilization rate satisfies a thresholdutilization rate, the search manager 514 can determine whether anavailable search node 506 shares a data store with the unavailablesearch node 506. If it does, the search manager 514 can assign thebucket to the available search node 506 that shares the data store withthe unavailable search node 506. In this way, the search manager 514 canreduce the likelihood that the bucket will be obtained from commonstorage 216, which can introduce additional delay to the query while thebucket is retrieved from common storage 216 to the data store shared bythe available search node 506.

In some instances, the search node mapping policy can indicate that thesearch manager 514 is to assign buckets to search nodes 506 randomly, orin a simple sequence (e.g., a first search nodes 506 is assigned a firstbucket, a second search node 506 is assigned a second bucket, etc.). Inother instances, as discussed, the search node mapping policy canindicate that the search manager 514 is to assign buckets to searchnodes 506 based on buckets previously assigned to a search nodes 506, ina prior or current search. As mentioned above, in some embodiments eachsearch node 506 may be associated with a local data store or cache ofinformation (e.g., in memory of the search nodes 506, such as randomaccess memory [“RAM”], disk-based cache, a data store, or other form ofstorage). Each search node 506 can store copies of one or more bucketsfrom the common storage 216 within the local cache, such that thebuckets may be more rapidly searched by search nodes 506. The searchmanager 514 (or cache manager 516) can maintain or retrieve from searchnodes 506 information identifying, for each relevant search node 506,what buckets are copied within local cache of the respective searchnodes 506. In the event that the search manager 514 determines that asearch node 506 assigned to execute a search has within its data storeor local cache a copy of an identified bucket, the search manager 514can preferentially assign the search node 506 to search thatlocally-cached bucket.

In still more embodiments, according to the search node mapping policy,search nodes 506 may be assigned based on overlaps of computingresources of the search nodes 506. For example, where a containerizedsearch node 506 is to retrieve a bucket from common storage 216 (e.g.,where a local cached copy of the bucket does not exist on the searchnode 506), such retrieval may use a relatively high amount of networkbandwidth or disk read/write bandwidth. Thus, assigning a secondcontainerized search node 506 instantiated on the same host computingdevice might be expected to strain or exceed the network or diskread/write bandwidth of the host computing device. For this reason, insome embodiments, according to the search node mapping policy, thesearch manager 514 can assign buckets to search nodes 506 such that twocontainerized search nodes 506 on a common host computing device do notboth retrieve buckets from common storage 216 at the same time.

Further, in certain embodiments, where a data store that is sharedbetween multiple search nodes 506 includes two buckets identified forthe search, the search manager 514 can, according to the search nodemapping policy, assign both such buckets to the same search node 506 orto two different search nodes 506 that share the data store, such thatboth buckets can be searched in parallel by the respective search nodes506.

The search node mapping policy can indicate that the search manager 514is to use any one or any combination of the above-described mechanismsto assign buckets to search nodes 506. Furthermore, the search nodemapping policy can indicate that the search manager 514 is to prioritizeassigning search nodes 506 to buckets based on any one or anycombination of: assigning search nodes 506 to process buckets that arein a local or shared data store of the search nodes 506, maximizingparallelization (e.g., assigning as many different search nodes 506 toexecute the query as are available), assigning search nodes 506 toprocess buckets with overlapping timestamps, maximizing individualsearch node 506 utilization (e.g., ensuring that each search node 506 issearching at least one bucket at any given time, etc.), or assigningsearch nodes 506 to process buckets associated with a particular tenant,user, or other known feature of data stored within the bucket (e.g.,buckets holding data known to be used in time-sensitive searches may beprioritized). Thus, according to the search node mapping policy, thesearch manager 514 can dynamically alter the assignment of buckets tosearch nodes 506 to increase the parallelization of a search, and toincrease the speed and efficiency with which the search is executed.

It will be understood that the search manager 514 can assign any searchnode 506 to search any bucket. This flexibility can decrease queryresponse time as the search manager can dynamically determine whichsearch nodes 506 are best suited or available to execute the query ondifferent buckets. Further, if one bucket is being used by multiplequeries, the search manager 515 can assign multiple search nodes 506 tosearch the bucket. In addition, in the event a search node 506 becomesunavailable or unresponsive, the search manager 514 can assign adifferent search node 506 to search the buckets assigned to theunavailable search node 506.

As part of the query execution, the search manager 514 can instruct thesearch nodes 506 to execute the query (or sub-query) on the assignedbuckets. As described herein, the search manager 514 can generatespecific queries or sub-queries for the individual search nodes 506. Thesearch nodes 506 can use the queries to execute the query on the bucketsassigned thereto.

In some embodiments, the search manager 514 stores the sub-queries andbucket assignments for the different search nodes 506. Storing thesub-queries and bucket assignments can contribute to the statelessnessof the query system 214. For example, in the event an assigned searchnode 506 becomes unresponsive or unavailable during the query execution,the search manager 514 can re-assign the sub-query and bucketassignments of the unavailable search node 506 to one or more availablesearch nodes 506 or identify a different available search node 506 fromthe search node catalog 510 to execute the sub-query. In certainembodiments, the query system manager 502 can generate an additionalsearch node 506 to execute the sub-query of the unavailable search node506. Accordingly, the query system 214 can quickly recover from anunavailable or unresponsive component without data loss and whilereducing or minimizing delay.

During the query execution, the search manager 514 can monitor thestatus of the assigned search nodes 506. In some cases, the searchmanager 514 can ping or set up a communication link between it and thesearch nodes 506 assigned to execute the query. As mentioned, the searchmanager 514 can store the mapping of the buckets to the search nodes506. Accordingly, in the event a particular search node 506 becomesunavailable for his unresponsive, the search manager 514 can assign adifferent search node 506 to complete the execution of the query for thebuckets assigned to the unresponsive search node 506.

In some cases, as part of the status updates to the search manager 514,the search nodes 506 can provide the search manager with partial resultsand information regarding the buckets that have been searched. Inresponse, the search manager 514 can store the partial results andbucket information in persistent storage. Accordingly, if a search node506 partially executes the query and becomes unresponsive orunavailable, the search manager 514 can assign a different search node506 to complete the execution, as described above. For example, thesearch manager 514 can assign a search node 506 to execute the query onthe buckets that were not searched by the unavailable search node 506.In this way, the search manager 514 can more quickly recover from anunavailable or unresponsive search node 506 without data loss and whilereducing or minimizing delay.

As the search manager 514 receives query results from the differentsearch nodes 506, it can process the data. In some cases, the searchmanager 514 processes the partial results as it receives them. Forexample, if the query includes a count, the search manager 514 canincrement the count as it receives the results from the different searchnodes 506. In certain cases, the search manager 514 waits for thecomplete results from the search nodes before processing them. Forexample, if the query includes a command that operates on a result set,or a partial result set, e.g., a stats command (e.g., a command thatcalculates one or more aggregate statistics over the results set, e.g.,average, count, or standard deviation, as examples), the search manager514 can wait for the results from all the search nodes 506 beforeexecuting the stats command.

As the search manager 514 processes the results or completes processingthe results, it can store the results in the query acceleration datastore 222 or communicate the results to a client device 204. Asdescribed herein, results stored in the query acceleration data store222 can be combined with other results over time. For example, if thequery system 212 receives an open-ended query (e.g., no set end time),the search manager 515 can store the query results over time in thequery acceleration data store 222. Query results in the queryacceleration data store 222 can be updated as additional query resultsare obtained. In this manner, if an open-ended query is run at time B,query results may be stored from initial time A to time B. If the sameopen-ended query is run at time C, then the query results from the prioropen-ended query can be obtained from the query acceleration data store222 (which gives the results from time A to time B), and the query canbe run from time B to time C and combined with the prior results, ratherthan running the entire query from time A to time C. In this manner, thecomputational efficiency of ongoing search queries can be improved.

3.3.3. Search Nodes

As described herein, the search nodes 506 can be the primary queryexecution engines for the query system 214, and can be implemented asdistinct computing devices, virtual machines, containers, container of apods, or processes or threads associated with one or more containers.Accordingly, each search node 506 can include a processing device and adata store, as depicted at a high level in FIG. 5 . Depending on theembodiment, the processing device and data store can be dedicated to thesearch node (e.g., embodiments where each search node is a distinctcomputing device) or can be shared with other search nodes or componentsof the data intake and query system 108 (e.g., embodiments where thesearch nodes are implemented as containers or virtual machines or wherethe shared data store is a networked data store, etc.).

In some embodiments, the search nodes 506 can obtain and search bucketsidentified by the search manager 514 that include data that satisfies atleast a portion of the query, identify the set of data within thebuckets that satisfies the query, perform one or more transformations onthe set of data, and communicate the set of data to the search manager514. Individually, a search node 506 can obtain the buckets assigned toit by the search manager 514 for a particular query, search the assignedbuckets for a subset of the set of data, perform one or moretransformation on the subset of data, and communicate partial searchresults to the search manager 514 for additional processing andcombination with the partial results from other search nodes 506.

In some cases, the buckets to be searched may be located in a local datastore of the search node 506 or a data store that is shared betweenmultiple search nodes 506. In such cases, the search nodes 506 canidentify the location of the buckets and search the buckets for the setof data that satisfies the query.

In certain cases, the buckets may be located in the common storage 216.In such cases, the search nodes 506 can search the buckets in the commonstorage 216 and/or copy the buckets from the common storage 216 to alocal or shared data store and search the locally stored copy for theset of data. As described herein, the cache manager 516 can coordinatewith the search nodes 506 to identify the location of the buckets(whether in a local or shared data store or in common storage 216)and/or obtain buckets stored in common storage 216.

Once the relevant buckets (or relevant files of the buckets) areobtained, the search nodes 506 can search their contents to identify theset of data to be processed. In some cases, upon obtaining a bucket fromthe common storage 216, a search node 306 can decompress the bucket froma compressed format, and accessing one or more files stored within thebucket. In some cases, the search node 306 references a bucket summaryor manifest to locate one or more portions (e.g., records or individualfiles) of the bucket that potentially contain information relevant tothe search.

In some cases, the search nodes 506 can use all of the files of a bucketto identify the set of data. In certain embodiments, the search nodes506 use a subset of the files of a bucket to identify the set of data.For example, in some cases, a search node 506 can use an inverted index,bloom filter, or bucket summary or manifest to identify a subset of theset of data without searching the raw machine data of the bucket. Incertain cases, the search node 506 uses the inverted index, bloomfilter, bucket summary, and raw machine data to identify the subset ofthe set of data that satisfies the query.

In some embodiments, depending on the query, the search nodes 506 canperform one or more transformations on the data from the buckets. Forexample, the search nodes 506 may perform various data transformations,scripts, and processes, e.g., a count of the set of data, etc.

As the search nodes 506 execute the query, they can provide the searchmanager 514 with search results. In some cases, a search node 506provides the search manager 514 results as they are identified by thesearch node 506, and updates the results over time. In certainembodiments, a search node 506 waits until all of its partial resultsare gathered before sending the results to the search manager 504.

In some embodiments, the search nodes 506 provide a status of the queryto the search manager 514. For example, an individual search node 506can inform the search manager 514 of which buckets it has searchedand/or provide the search manager 514 with the results from the searchedbuckets. As mentioned, the search manager 514 can track or store thestatus and the results as they are received from the search node 506. Inthe event the search node 506 becomes unresponsive or unavailable, thetracked information can be used to generate and assign a new search node506 to execute the remaining portions of the query assigned to theunavailable search node 506.

3.3.4. Cache Manager

As mentioned, the cache manager 516 can communicate with the searchnodes 506 to obtain or identify the location of the buckets assigned tothe search nodes 506, and can be implemented as a distinct computingdevice, virtual machine, container, container of a pod, or a process orthread associated with a container.

In some embodiments, based on the receipt of a bucket assignment, asearch node 506 can provide the cache manager 516 with an identifier ofthe bucket that it is to search, a file associated with the bucket thatit is to search, and/or a location of the bucket. In response, the cachemanager 516 can determine whether the identified bucket or file islocated in a local or shared data store or is to be retrieved from thecommon storage 216.

As mentioned, in some cases, multiple search nodes 506 can share a datastore. Accordingly, if the cache manager 516 determines that therequested bucket is located in a local or shared data store, the cachemanager 516 can provide the search node 506 with the location of therequested bucket or file. In certain cases, if the cache manager 516determines that the requested bucket or file is not located in the localor shared data store, the cache manager 516 can request the bucket orfile from the common storage 216, and inform the search node 506 thatthe requested bucket or file is being retrieved from common storage 216.

In some cases, the cache manager 516 can request one or more filesassociated with the requested bucket prior to, or in place of,requesting all contents of the bucket from the common storage 216. Forexample, a search node 506 may request a subset of files from aparticular bucket. Based on the request and a determination that thefiles are located in common storage 216, the cache manager 516 candownload or obtain the identified files from the common storage 216.

In some cases, based on the information provided from the search node506, the cache manager 516 may be unable to uniquely identify arequested file or files within the common storage 216. Accordingly, incertain embodiments, the cache manager 516 can retrieve a bucket summaryor manifest file from the common storage 216 and provide the bucketsummary to the search node 506. In some cases, the cache manager 516 canprovide the bucket summary to the search node 506 while concurrentlyinforming the search node 506 that the requested files are not locatedin a local or shared data store and are to be retrieved from commonstorage 216.

Using the bucket summary, the search node 506 can uniquely identify thefiles to be used to execute the query. Using the unique identification,the cache manager 516 can request the files from the common storage 216.Accordingly, rather than downloading the entire contents of the bucketfrom common storage 216, the cache manager 516 can download thoseportions of the bucket that are to be used by the search node 506 toexecute the query. In this way, the cache manager 516 can decrease theamount of data sent over the network and decrease the search time.

As a non-limiting example, a search node 506 may determine that aninverted index of a bucket is to be used to execute a query. Forexample, the search node 506 may determine that all the information thatit needs to execute the query on the bucket can be found in an invertedindex associated with the bucket. Accordingly, the search node 506 canrequest the file associated with the inverted index of the bucket fromthe cache manager 516. Based on a determination that the requested fileis not located in a local or shared data store, the cache manager 516can determine that the file is located in the common storage 216.

As the bucket may have multiple inverted indexes associated with it, theinformation provided by the search node 506 may be insufficient touniquely identify the inverted index within the bucket. To address thisissue, the cache manager 516 can request a bucket summary or manifestfrom the common storage 216, and forward it to the search node 506. Thesearch node 506 can analyze the bucket summary to identify theparticular inverted index that is to be used to execute the query, andrequest the identified particular inverted index from the cache manager516 (e.g., by name and/or location). Using the bucket manifest and/orthe information received from the search node 506, the cache manager 516can obtain the identified particular inverted index from the commonstorage 216. By obtaining the bucket manifest and downloading therequested inverted index instead of all inverted indexes or files of thebucket, the cache manager 516 can reduce the amount of data communicatedover the network and reduce the search time for the query.

In some cases, when requesting a particular file, the search node 506can include a priority level for the file. For example, the files of abucket may be of different sizes and may be used more or less frequentlywhen executing queries. For example, the bucket manifest may be arelatively small file. However, if the bucket is searched, the bucketmanifest can be a relatively valuable file (and frequently used) becauseit includes a list or index of the various files of the bucket.Similarly, a bloom filter of a bucket may be a relatively small file butfrequently used as it can relatively quickly identify the contents ofthe bucket. In addition, an inverted index may be used more frequentlythan raw data of a bucket to satisfy a query.

Accordingly, to improve retention of files that are commonly used in asearch of a bucket, the search node 506 can include a priority level forthe requested file. The cache manager 516 can use the priority levelreceived from the search node 506 to determine how long to keep or whento evict the file from the local or shared data store. For example,files identified by the search node 506 as having a higher prioritylevel can be stored for a greater period of time than files identifiedas having a lower priority level.

Furthermore, the cache manager 516 can determine what data and how longto retain the data in the local or shared data stores of the searchnodes 506 based on a bucket caching policy. In some cases, the bucketcaching policy can rely on any one or any combination of the prioritylevel received from the search nodes 506 for a particular file, leastrecently used, most recent in time, or other policies to indicate howlong to retain files in the local or shared data store.

In some instances, according to the bucket caching policy, the cachemanager 516 or other component of the query system 214 (e.g., the searchmaster 512 or search manager 514) can instruct search nodes 506 toretrieve and locally cache copies of various buckets from the commonstorage 216, independently of processing queries. In certainembodiments, the query system 214 is configured, according to the bucketcaching policy, such that one or more buckets from the common storage216 (e.g., buckets associated with a tenant or partition of a tenant) oreach bucket from the common storage 216 is locally cached on at leastone search node 506.

In some embodiments, according to the bucket caching policy, the querysystem 214 is configured such that at least one bucket from the commonstorage 216 is locally cached on at least two search nodes 506. Cachinga bucket on at least two search nodes 506 may be beneficial, forexample, in instances where different queries both require searching thebucket (e.g., because the at least search nodes 506 may process theirrespective local copies in parallel). In still other embodiments, thequery system 214 is configured, according to the bucket caching policy,such that one or more buckets from the common storage 216 or all bucketsfrom the common storage 216 are locally cached on at least a givennumber n of search nodes 506, wherein n is defined by a replicationfactor on the system 108. For example, a replication factor of five maybe established to ensure that five copies of a bucket are locally cachedacross different search nodes 506.

In certain embodiments, the search manager 514 (or search master 512)can assign buckets to different search nodes 506 based on time. Forexample, buckets that are less than one day old can be assigned to afirst group of search nodes 506 for caching, buckets that are more thanone day but less than one week old can be assigned to a different groupof search nodes 506 for caching, and buckets that are more than one weekold can be assigned to a third group of search nodes 506 for caching. Incertain cases, the first group can be larger than the second group, andthe second group can be larger than the third group. In this way, thequery system 214 can provide better/faster results for queries searchingdata that is less than one day old, and so on, etc. It will beunderstood that the search nodes can be grouped and assigned buckets ina variety of ways. For example, search nodes 506 can be grouped based ona tenant identifier, index, etc. In this way, the query system 212 candynamically provide faster results based any one or any number offactors.

In some embodiments, when a search node 506 is added to the query system214, the cache manager 516 can, based on the bucket caching policy,instruct the search node 506 to download one or more buckets from commonstorage 216 prior to receiving a query. In certain embodiments, thecache manager 516 can instruct the search node 506 to download specificbuckets, such as most recent in time buckets, buckets associated with aparticular tenant or partition, etc. In some cases, the cache manager516 can instruct the search node 506 to download the buckets before thesearch node 506 reports to the search node monitor 508 that it isavailable for executing queries. It will be understood that othercomponents of the query system 214 can implement this functionality,such as, but not limited to the query system manager 502, search nodemonitor 508, search manager 514, or the search nodes 506 themselves.

In certain embodiments, when a search node 506 is removed from the querysystem 214 or becomes unresponsive or unavailable, the cache manager 516can identify the buckets that the removed search node 506 wasresponsible for and instruct the remaining search nodes 506 that theywill be responsible for the identified buckets. In some cases, theremaining search nodes 506 can download the identified buckets fromcommon storage 516 or retrieve them from the data store associated withthe removed search node 506.

In some cases, the cache manager 516 can change the bucket-search node506 assignments, such as when a search node 506 is removed or added. Incertain embodiments, based on a reassignment, the cache manager 516 caninform a particular search node 506 to remove buckets to which it is nolonger assigned, reduce the priority level of the buckets, etc. In thisway, the cache manager 516 can make it so the reassigned bucket will beremoved more quickly from the search node 506 than it otherwise wouldwithout the reassignment. In certain embodiments, the search node 506that receives the new for the bucket can retrieve the bucket from thenow unassigned search node 506 and/or retrieve the bucket from commonstorage 216.

3.3.5. Search Node Monitor and Catalog

The search node monitor 508 can monitor search nodes and populate thesearch node catalog 510 with relevant information, and can beimplemented as a distinct computing device, virtual machine, container,container of a pod, or a process or thread associated with a container.

In some cases, the search node monitor 508 can ping the search nodes 506over time to determine their availability, responsiveness, and/orutilization rate. In certain embodiments, each search node 506 caninclude a monitoring module that provides performance metrics or statusupdates about the search node 506 to the search node monitor 508. Forexample, the monitoring module can indicate the amount of processingresources in use by the search node 506, the utilization rate of thesearch node 506, the amount of memory used by the search node 506, etc.In certain embodiments, the search node monitor 508 can determine that asearch node 506 is unavailable or failing based on the data in thestatus update or absence of a state update from the monitoring module ofthe search node 506.

Using the information obtained from the search nodes 506, the searchnode monitor 508 can populate the search node catalog 510 and update itover time. As described herein, the search manager 514 can use thesearch node catalog 510 to identify search nodes 506 available toexecute a query. In some embodiments, the search manager 214 cancommunicate with the search node catalog 510 using an API.

As the availability, responsiveness, and/or utilization change for thedifferent search nodes 506, the search node monitor 508 can update thesearch node catalog 510. In this way, the search node catalog 510 canretain an up-to-date list of search nodes 506 available to execute aquery.

Furthermore, as search nodes 506 are instantiated (or at other times),the search node monitor 508 can update the search node catalog 510 withinformation about the search node 506, such as, but not limited to itscomputing resources, utilization, network architecture (identificationof machine where it is instantiated, location with reference to othersearch nodes 506, computing resources shared with other search nodes506, such as data stores, processors, I/O, etc.), etc.

3.4. Common Storage

Returning to FIG. 2 , the common storage 216 can be used to store dataindexed by the indexing system 212, and can be implemented using one ormore data stores 218.

In some systems, the same computing devices (e.g., indexers) operateboth to ingest, index, store, and search data. The use of an indexer toboth ingest and search information may be beneficial, for example,because an indexer may have ready access to information that it hasingested, and can quickly access that information for searchingpurposes. However, use of an indexer to both ingest and searchinformation may not be desirable in all instances. As an illustrativeexample, consider an instance in which ingested data is organized intobuckets, and each indexer is responsible for maintaining buckets withina data store corresponding to the indexer. Illustratively, a set of tenindexers may maintain 100 buckets, distributed evenly across ten datastores (each of which is managed by a corresponding indexer).Information may be distributed throughout the buckets according to aload-balancing mechanism used to distribute information to the indexersduring data ingestion. In an idealized scenario, information responsiveto a query would be spread across the 100 buckets, such that eachindexer may search their corresponding ten buckets in parallel, andprovide search results to a search head. However, it is expected thatthis idealized scenario may not always occur, and that there will be atleast some instances in which information responsive to a query isunevenly distributed across data stores. As one example, consider aquery in which responsive information exists within ten buckets, all ofwhich are included in a single data store associated with a singleindexer. In such an instance, a bottleneck may be created at the singleindexer, and the effects of parallelized searching across the indexersmay be minimized. To increase the speed of operation of search queriesin such cases, it may therefore be desirable to store data indexed bythe indexing system 212 in common storage 216 that can be accessible toany one or multiple components of the indexing system 212 or the querysystem 214.

Common storage 216 may correspond to any data storage system accessibleto the indexing system 212 and the query system 214. For example, commonstorage 216 may correspond to a storage area network (SAN), networkattached storage (NAS), other network-accessible storage system (e.g., ahosted storage system, such as Amazon S3 or EBS provided by Amazon,Inc., Google Cloud Storage, Microsoft Azure Storage, etc., which mayalso be referred to as “cloud” storage), or combination thereof. Thecommon storage 216 may include, for example, hard disk drives (HDDs),solid state storage devices (SSDs), or other substantially persistent ornon-transitory media. Data stores 218 within common storage 216 maycorrespond to physical data storage devices (e.g., an individual HDD) ora logical storage device, such as a grouping of physical data storagedevices or a containerized or virtualized storage device hosted by anunderlying physical storage device. In some embodiments, the commonstorage 216 may also be referred to as a shared storage system or sharedstorage environment as the data stores 218 may store data associatedwith multiple customers, tenants, etc., or across different data intakeand query systems 108 or other systems unrelated to the data intake andquery systems 108.

The common storage 216 can be configured to provide high availability,highly resilient, low loss data storage. In some cases, to provide thehigh availability, highly resilient, low loss data storage, the commonstorage 216 can store multiple copies of the data in the same anddifferent geographic locations and across different types of data stores(e.g., solid state, hard drive, tape, etc.). Further, as data isreceived at the common storage 216 it can be automatically replicatedmultiple times according to a replication factor to different datastores across the same and/or different geographic locations.

In one embodiment, common storage 216 may be multi-tiered, with eachtier providing more rapid access to information stored in that tier. Forexample, a first tier of the common storage 216 may be physicallyco-located with the indexing system 212 or the query system 214 andprovide rapid access to information of the first tier, while a secondtier may be located in a different physical location (e.g., in a hostedor “cloud” computing environment) and provide less rapid access toinformation of the second tier.

Distribution of data between tiers may be controlled by any number ofalgorithms or mechanisms. In one embodiment, a first tier may includedata generated or including timestamps within a threshold period of time(e.g., the past seven days), while a second tier or subsequent tiersincludes data older than that time period. In another embodiment, afirst tier may include a threshold amount (e.g., n terabytes) orrecently accessed data, while a second tier stores the remaining lessrecently accessed data.

In one embodiment, data within the data stores 218 is grouped intobuckets, each of which is commonly accessible to the indexing system 212and query system 214. The size of each bucket may be selected accordingto the computational resources of the common storage 216 or the dataintake and query system 108 overall. For example, the size of eachbucket may be selected to enable an individual bucket to be relativelyquickly transmitted via a network, without introducing excessiveadditional data storage requirements due to metadata or other overheadassociated with an individual bucket. In one embodiment, each bucket is750 megabytes in size. Further, as mentioned, in some embodiments, somebuckets can be merged to create larger buckets.

As described herein, each bucket can include one or more files, such as,but not limited to, one or more compressed or uncompressed raw machinedata files, metadata files, filter files, indexes files, bucket summaryor manifest files, etc. In addition, each bucket can store eventsincluding raw machine data associated with a timestamp.

As described herein, the indexing nodes 404 can generate buckets duringindexing and communicate with common storage 216 to store the buckets.For example, data may be provided to the indexing nodes 404 from one ormore ingestion buffers of the intake system 210 The indexing nodes 404can process the information and store it as buckets in common storage216, rather than in a data store maintained by an individual indexer orindexing node. Thus, the common storage 216 can render information ofthe data intake and query system 108 commonly accessible to elements ofthe system 108. As described herein, the common storage 216 can enableparallelized searching of buckets to occur independently of theoperation of indexing system 212.

As noted above, it may be beneficial in some instances to separate dataindexing and searching. Accordingly, as described herein, the searchnodes 506 of the query system 214 can search for data stored withincommon storage 216. The search nodes 506 may therefore becommunicatively attached (e.g., via a communication network) with thecommon storage 216, and be enabled to access buckets within the commonstorage 216.

Further, as described herein, because the search nodes 506 in someinstances are not statically assigned to individual data stores 218 (andthus to buckets within such a data store 218), the buckets searched byan individual search node 506 may be selected dynamically, to increasethe parallelization with which the buckets can be searched. For example,consider an instance where information is stored within 100 buckets, anda query is received at the data intake and query system 108 forinformation within ten buckets. Unlike a scenario in which buckets arestatically assigned to an indexer, which could result in a bottleneck ifthe ten relevant buckets are associated with the same indexer, the tenbuckets holding relevant information may be dynamically distributedacross multiple search nodes 506. Thus, if ten search nodes 506 areavailable to process a query, each search node 506 may be assigned toretrieve and search within one bucket greatly increasing parallelizationwhen compared to the low-parallelization scenarios (e.g., where a singleindexer 206 is required to search all ten buckets).

Moreover, because searching occurs at the search nodes 506 rather thanat the indexing system 212, indexing resources can be allocatedindependently to searching operations. For example, search nodes 506 maybe executed by a separate processor or computing device than indexingnodes 404, enabling computing resources available to search nodes 506 toscale independently of resources available to indexing nodes 404.Additionally, the impact on data ingestion and indexing due toabove-average volumes of search query requests is reduced or eliminated,and similarly, the impact of data ingestion on search query resultgeneration time also is reduced or eliminated.

As will be appreciated in view of the above description, the use of acommon storage 216 can provide many advantages within the data intakeand query system 108. Specifically, use of a common storage 216 canenable the system 108 to decouple functionality of data indexing byindexing nodes 404 with functionality of searching by search nodes 506.Moreover, because buckets containing data are accessible by each searchnode 506, a search manager 514 can dynamically allocate search nodes 506to buckets at the time of a search in order to increase parallelization.Thus, use of a common storage 216 can substantially improve the speedand efficiency of operation of the system 108.

3.5. Data Store Catalog

The data store catalog 220 can store information about the data storedin common storage 216, and can be implemented using one or more datastores. In some embodiments, the data store catalog 220 can beimplemented as a portion of the common storage 216 and/or using similardata storage techniques (e.g., local or cloud storage, multi-tieredstorage, etc.). In another implementation, the data store catalog 220may utilize a database, e.g., a relational database engine, such ascommercially-provided relational database services, e.g., Amazon'sAurora. In some implementations, the data store catalog 220 may use anAPI to allow access to register buckets, and to allow query system 214to access buckets. In other implementations, data store catalog 220 maybe implemented through other means, and maybe stored as part of commonstorage 216, or another type of common storage, as previously described.In various implementations, requests for buckets may include a tenantidentifier and some form of user authentication, e.g., a user accesstoken that can be authenticated by authentication service. In variousimplementations, the data store catalog 220 may store one datastructure, e.g., table, per tenant, for the buckets associated with thattenant, one data structure per partition of each tenant, etc. In otherimplementations, a single data structure, e.g., a single table, may beused for all tenants, and unique tenant IDs may be used to identifybuckets associated with the different tenants.

As described herein, the data store catalog 220 can be updated by theindexing system 212 with information about the buckets or data stored incommon storage 216. For example, the data store catalog can store anidentifier for a sets of data in common storage 216, a location of thesets of data in common storage 216, tenant or indexes associated withthe sets of data, timing information about the sets of data, etc. Inembodiments where the data in common storage 216 is stored as buckets,the data store catalog 220 can include a bucket identifier for thebuckets in common storage 216, a location of or path to the buckets incommon storage 216, a time range of the data in the bucket (e.g., rangeof time between the first-in-time event of the bucket and thelast-in-time event of the bucket), a tenant identifier identifying acustomer or computing device associated with the bucket, and/or an indexor partition associated with the bucket, etc.

In certain embodiments, the data store catalog 220 can include anindication of a location of a copy of a bucket found in one or moresearch nodes 506. For example, as buckets are copied to search nodes506, the query system 214 can update the data store catalog 220 withinformation about which search nodes 506 include a copy of the buckets.This information can be used by the query system 214 to assign searchnodes 506 to buckets as part of a query.

In certain embodiments, the data store catalog 220 can function as anindex or inverted index of the buckets stored in common storage 216. Forexample, the data store catalog 220 can provide location and otherinformation about the buckets stored in common storage 216. In someembodiments, the data store catalog 220 can provide additionalinformation about the contents of the buckets. For example, the datastore catalog 220 can provide a list of sources, sourcetypes, or hostsassociated with the data in the buckets.

In certain embodiments, the data store catalog 220 can include one ormore keywords found within the data of the buckets. In such embodiments,the data store catalog can be similar to an inverted index, exceptrather than identifying specific events associated with a particularhost, source, sourcetype, or keyword, it can identify buckets with dataassociated with the particular host, source, sourcetype, or keyword.

In some embodiments, the query system 214 (e.g., search head 504, searchmaster 512, search manager 514, etc.) can communicate with the datastore catalog 220 as part of processing and executing a query. Incertain cases, the query system 214 communicates with the data storecatalog 220 using an API. As a non-limiting example, the query system214 can provide the data store catalog 220 with at least a portion ofthe query or one or more filter criteria associated with the query. Inresponse, the data store catalog 220 can provide the query system 214with an identification of buckets that store data that satisfies atleast a portion of the query. In addition, the data store catalog 220can provide the query system 214 with an indication of the location ofthe identified buckets in common storage 216 and/or in one or more localor shared data stores of the search nodes 506.

Accordingly, using the information from the data store catalog 220, thequery system 214 can reduce (or filter) the amount of data or number ofbuckets to be searched. For example, using tenant or partitioninformation in the data store catalog 220, the query system 214 canexclude buckets associated with a tenant or a partition, respectively,that is not to be searched Similarly, using time range information, thequery system 214 can exclude buckets that do not satisfy a time rangefrom a search. In this way, the data store catalog 220 can reduce theamount of data to be searched and decrease search times.

As mentioned, in some cases, as buckets are copied from common storage216 to search nodes 506 as part of a query, the query system 214 canupdate the data store catalog 220 with the location information of thecopy of the bucket. The query system 214 can use this information toassign search nodes 506 to buckets. For example, if the data storecatalog 220 indicates that a copy of a bucket in common storage 216 isstored in a particular search node 506, the query system 214 can assignthe particular search node to the bucket. In this way, the query system214 can reduce the likelihood that the bucket will be retrieved fromcommon storage 216. In certain embodiments, the data store catalog 220can store an indication that a bucket was recently downloaded to asearch node 506. The query system 214 for can use this information toassign search node 506 to that bucket.

3.6. Query Acceleration Data Store

With continued reference to FIG. 2 , the query acceleration data store222 can be used to store query results or datasets for acceleratedaccess, and can be implemented as, a distributed in-memory databasesystem, storage subsystem, local or networked storage (e.g., cloudstorage), and so on, which can maintain (e.g., store) datasets in bothlow-latency memory (e.g., random access memory, such as volatile ornon-volatile memory) and longer-latency memory (e.g., solid statestorage, disk drives, and so on). In some embodiments, to increaseefficiency and response times, the accelerated data store 222 canmaintain particular datasets in the low-latency memory, and otherdatasets in the longer-latency memory. For example, in some embodiments,the datasets can be stored in-memory (non-limiting examples: RAM orvolatile memory) with disk spillover (non-limiting examples: hard disks,disk drive, non-volatile memory, etc.). In this way, the queryacceleration data store 222 can be used to serve interactive oriterative searches. In some cases, datasets which are determined to befrequently accessed by a user can be stored in the lower-latency memory.Similarly, datasets of less than a threshold size can be stored in thelower-latency memory.

In certain embodiments, the search manager 514 or search nodes 506 canstore query results in the query acceleration data store 222. In someembodiments, the query results can correspond to partial results fromone or more search nodes 506 or to aggregated results from all thesearch nodes 506 involved in a query or the search manager 514. In suchembodiments, the results stored in the query acceleration data store 222can be served at a later time to the search head 504, combined withadditional results obtained from a later query, transformed or furtherprocessed by the search nodes 506 or search manager 514, etc. Forexample, in some cases, such as where a query does not include atermination date, the search manager 514 can store initial results inthe acceleration data store 222 and update the initial results asadditional results are received. At any time, the initial results, oriteratively updated results can be provided to a client device 204,transformed by the search nodes 506 or search manager 514, etc.

As described herein, a user can indicate in a query that particulardatasets or results are to be stored in the query acceleration datastore 222. The query can then indicate operations to be performed on theparticular datasets. For subsequent queries directed to the particulardatasets (e.g., queries that indicate other operations for the datasetsstored in the acceleration data store 222), the search nodes 506 canobtain information directly from the query acceleration data store 222.

Additionally, since the query acceleration data store 222 can beutilized to service requests from different client devices 204, thequery acceleration data store 222 can implement access controls (e.g.,an access control list) with respect to the stored datasets. In thisway, the stored datasets can optionally be accessible only to usersassociated with requests for the datasets. Optionally, a user whoprovides a query can indicate that one or more other users areauthorized to access particular requested datasets. In this way, theother users can utilize the stored datasets, thus reducing latencyassociated with their queries.

In some cases, data from the intake system 210 (e.g., ingested databuffer 310, etc.) can be stored in the acceleration data store 222. Insuch embodiments, the data from the intake system 210 can be transformedby the search nodes 506 or combined with data in the common storage 216

Furthermore, in some cases, if the query system 214 receives a querythat includes a request to process data in the query acceleration datastore 222, as well as data in the common storage 216, the search manager514 or search nodes 506 can begin processing the data in the queryacceleration data store 222, while also obtaining and processing theother data from the common storage 216. In this way, the query system214 can rapidly provide initial results for the query, while the searchnodes 506 obtain and search the data from the common storage 216.

It will be understood that the data intake and query system 108 caninclude fewer or more components as desired. For example, in someembodiments, the system 108 does not include an acceleration data store222. Further, it will be understood that in some embodiments, thefunctionality described herein for one component can be performed byanother component. For example, the search master 512 and search manager514 can be combined as one component, etc.

4.0. Data Intake and Query System Functions

As described herein, the various components of the data intake and querysystem 108 can perform a variety of functions associated with theintake, indexing, storage, and querying of data from a variety ofsources. It will be understood that any one or any combination of thefunctions described herein can be combined as part of a single routineor method. For example, a routine can include any one or any combinationof one or more data ingestion functions, one or more indexing functions,and/or one or more searching functions.

4.1 Ingestion

As discussed above, ingestion into the data intake and query system 108can be facilitated by an intake system 210, which functions to processdata according to a streaming data model, and make the data available asmessages on an output ingestion buffer 310, categorized according to anumber of potential topics. Messages may be published to the outputingestion buffer 310 by a streaming data processors 308, based onpreliminary processing of messages published to an intake ingestionbuffer 306. The intake ingestion buffer 304 is, in turn, populated withmessages by one or more publishers, each of which may represent anintake point for the data intake and query system 108. The publishersmay collectively implement a data retrieval subsystem 304 for the dataintake and query system 108, which subsystem 304 functions to retrievedata from a data source 202 and publish the data in the form of amessage on the intake ingestion buffer 304. A flow diagram depicting anillustrative embodiment for processing data at the intake system 210 isshown at FIG. 6 . While the flow diagram is illustratively describedwith respect to a single message, the same or similar interactions maybe used to process multiple messages at the intake system 210.

4.1.1 Publication to Intake Topic(s)

As shown in FIG. 6 , processing of data at the intake system 210 canillustratively begin at (1), where a data retrieval subsystem 304 or adata source 202 publishes a message to a topic at the intake ingestionbuffer 306. Generally described, the data retrieval subsystem 304 mayinclude either or both push-based and pull-based publishers. Push-basedpublishers can illustratively correspond to publishers whichindependently initiate transmission of messages to the intake ingestionbuffer 306. Pull-based publishes can illustratively correspond topublishers which await an inquiry by the intake ingestion buffer 306 formessages to be published to the buffer 306. The publication of a messageat (1) is intended to include publication under either push- orpull-based models.

As discussed above, the data retrieval subsystem 304 may generate themessage based on data received from a forwarder 302 and/or from one ormore data sources 202. In some instances, generation of a message mayinclude converting a format of the data into a format suitable forpublishing on the intake ingestion buffer 306. Generation of a messagemay further include determining a topic for the message. In oneembodiment, the data retrieval subsystem 304 selects a topic based on adata source 202 from which the data is received, or based on thespecific publisher (e.g., intake point) on which the message isgenerated. For example, each data source 202 or specific publisher maybe associated with a particular topic on the intake ingestion buffer 306to which corresponding messages are published. In some instances, thesame source data may be used to generate multiple messages to the intakeingestion buffer 306 (e.g., associated with different topics).

4.1.2 Transmission to Streaming Data Processors

After receiving a message from a publisher, the intake ingestion buffer306, at (2), determines subscribers to the topic. For the purposes ofexample, it will be associated that at least one device of the streamingdata processors 308 has subscribed to the topic (e.g., by previouslytransmitting to the intake ingestion buffer 306 a subscription request).As noted above, the streaming data processors 308 may be implemented bya number of (logically or physically) distinct devices. As such, thestreaming data processors 308, at (2), may operate to determine whichdevices of the streaming data processors 308 have subscribed to thetopic (or topics) to which the message was published.

Thereafter, at (3), the intake ingestion buffer 306 publishes themessage to the streaming data processors 308 in accordance with thepub-sub model. This publication may correspond to a “push” model ofcommunication, whereby an ingestion buffer determines topic subscribersand initiates transmission of messages within the topic to thesubscribers. While interactions of FIG. 6 are described with referenceto such a push model, in some embodiments a pull model of transmissionmay additionally or alternatively be used. Illustratively, rather thanan ingestion buffer determining topic subscribers and initiatingtransmission of messages for the topic to a subscriber (e.g., thestreaming data processors 308), an ingestion buffer may enable asubscriber to query for unread messages for a topic, and for thesubscriber to initiate transmission of the messages from the ingestionbuffer to the subscriber. Thus, an ingestion buffer (e.g., the intakeingestion buffer 306) may enable subscribers to “pull” messages from thebuffer. As such, interactions of FIG. 6 (e.g., including interactions(2) and (3) as well as (9), (10), (16), and (17) described below) may bemodified to include pull-based interactions (e.g., whereby a subscriberqueries for unread messages and retrieves the messages from anappropriate ingestion buffer).

4.1.3 Messages Processing

On receiving a message, the streaming data processors 308, at (4),analyze the message to determine one or more rules applicable to themessage. As noted above, rules maintained at the streaming dataprocessors 308 can generally include selection criteria indicatingmessages to which the rule applies. This selection criteria may beformatted in the same manner or similarly to extraction rules, discussedin more detail below, and may include any number or combination ofcriteria based on the data included within a message or metadata of themessage, such as regular expressions based on the data or metadata.

On determining that a rule is applicable to the message, the streamingdata processors 308 can apply to the message one or more processingsub-rules indicated within the rule. Processing sub-rules may includemodifying data or metadata of the message. Illustratively, processingsub-rules may edit or normalize data of the message (e.g., to convert aformat of the data) or inject additional information into the message(e.g., retrieved based on the data of the message). For example, aprocessing sub-rule may specify that the data of the message betransformed according to a transformation algorithmically specifiedwithin the sub-rule. Thus, at (5), the streaming data processors 308applies the sub-rule to transform the data of the message.

In addition or alternatively, processing sub-rules can specify adestination of the message after the message is processed at thestreaming data processors 308. The destination may include, for example,a specific ingestion buffer (e.g., intake ingestion buffer 306, outputingestion buffer 310, etc.) to which the message should be published, aswell as the topic on the ingestion buffer to which the message should bepublished. For example, a particular rule may state that messagesincluding metrics within a first format (e.g., imperial units) shouldhave their data transformed into a second format (e.g., metric units)and be republished to the intake ingestion buffer 306. At such, at (6),the streaming data processors 308 can determine a target ingestionbuffer and topic for the transformed message based on the ruledetermined to apply to the message. Thereafter, the streaming dataprocessors 308 publishes the message to the destination buffer andtopic.

For the purposes of illustration, the interactions of FIG. 6 assumethat, during an initial processing of a message, the streaming dataprocessors 308 determines (e.g., according to a rule of the dataprocessor) that the message should be republished to the intakeingestion buffer 306, as shown at (7). The streaming data processors 308further acknowledges the initial message to the intake ingestion buffer306, at (8), thus indicating to the intake ingestion buffer 306 that thestreaming data processors 308 has processed the initial message orpublished it to an intake ingestion buffer. The intake ingestion buffer306 may be configured to maintain a message until all subscribers haveacknowledged receipt of the message. Thus, transmission of theacknowledgement at (8) may enable the intake ingestion buffer 306 todelete the initial message.

It is assumed for the purposes of these illustrative interactions thatat least one device implementing the streaming data processors 308 hassubscribed to the topic to which the transformed message is published.Thus, the streaming data processors 308 is expected to again receive themessage (e.g., as previously transformed the streaming data processors308), determine whether any rules apply to the message, and process themessage in accordance with one or more applicable rules. In this manner,interactions (2) through (8) may occur repeatedly, as designated in FIG.6 by the iterative processing loop 402. By use of iterative processing,the streaming data processors 308 may be configured to progressivelytransform or enrich messages obtained at data sources 202. Moreover,because each rule may specify only a portion of the total transformationor enrichment of a message, rules may be created without knowledge ofthe entire transformation. For example, a first rule may be provided bya first system to transform a message according to the knowledge of thatsystem (e.g., transforming an error code into an error descriptor),while a second rule may process the message according to thetransformation (e.g., by detecting that the error descriptor satisfiesalert criteria). Thus, the streaming data processors 308 enable highlygranulized processing of data without requiring an individual entity(e.g., user or system) to have knowledge of all permutations ortransformations of the data.

After completion of the iterative processing loop 402, the interactionsof FIG. 6 proceed to interaction (9), where the intake ingestion buffer306 again determines subscribers of the message. The intake ingestionbuffer 306, at (10), the transmits the message to the streaming dataprocessors 308, and the streaming data processors 308 again analyze themessage for applicable rules, process the message according to therules, determine a target ingestion buffer and topic for the processedmessage, and acknowledge the message to the intake ingestion buffer 306,at interactions (11), (12), (13), and (15). These interactions aresimilar to interactions (4), (5), (6), and (8) discussed above, andtherefore will not be re-described. However, in contrast to interaction(13), the streaming data processors 308 may determine that a targetingestion buffer for the message is the output ingestion buffer 310.Thus, the streaming data processors 308, at (14), publishes the messageto the output ingestion buffer 310, making the data of the messageavailable to a downstream system.

FIG. 6 illustrates one processing path for data at the streaming dataprocessors 308. However, other processing paths may occur according toembodiments of the present disclosure. For example, in some instances, arule applicable to an initially published message on the intakeingestion buffer 306 may cause the streaming data processors 308 topublish the message out ingestion buffer 310 on first processing thedata of the message, without entering the iterative processing loop 402.Thus, interactions (2) through (8) may be omitted.

In other instances, a single message published to the intake ingestionbuffer 306 may spawn multiple processing paths at the streaming dataprocessors 308. Illustratively, the streaming data processors 308 may beconfigured to maintain a set of rules, and to independently apply to amessage all rules applicable to the message. Each application of a rulemay spawn an independent processing path, and potentially a new messagefor publication to a relevant ingestion buffer. In other instances, thestreaming data processors 308 may maintain a ranking of rules to beapplied to messages, and may be configured to process only a highestranked rule which applies to the message. Thus, a single message on theintake ingestion buffer 306 may result in a single message or multiplemessages published by the streaming data processors 308, according tothe configuration of the streaming data processors 308 in applyingrules.

As noted above, the rules applied by the streaming data processors 308may vary during operation of those processors 308. For example, therules may be updated as user queries are received (e.g., to identifymessages whose data is relevant to those queries). In some instances,rules of the streaming data processors 308 may be altered during theprocessing of a message, and thus the interactions of FIG. 6 may bealtered dynamically during operation of the streaming data processors308.

While the rules above are described as making various illustrativealterations to messages, various other alterations are possible withinthe present disclosure. For example, rules in some instances be used toremove data from messages, or to alter the structure of the messages toconform to the format requirements of a downstream system or component.Removal of information may be beneficial, for example, where themessages include private, personal, or confidential information which isunneeded or should not be made available by a downstream system. In someinstances, removal of information may include replacement of theinformation with a less confidential value. For example, a mailingaddress may be considered confidential information, whereas a postalcode may not be. Thus, a rule may be implemented at the streaming dataprocessors 308 to replace mailing addresses with a corresponding postalcode, to ensure confidentiality. Various other alterations will beapparent in view of the present disclosure.

4.1.4 Transmission to Subscribers

As discussed above, the rules applied by the streaming data processors308 may eventually cause a message containing data from a data source202 to be published to a topic on an output ingestion buffer 310, whichtopic may be specified, for example, by the rule applied by thestreaming data processors 308. The output ingestion buffer 310 maythereafter make the message available to downstream systems orcomponents. These downstream systems or components are generallyreferred to herein as “subscribers.” For example, the indexing system212 may subscribe to an indexing topic 342, the query system 214 maysubscribe to a search results topic 348, a client device 102 maysubscribe to a custom topic 352A, etc. In accordance with the pub-submodel, the output ingestion buffer 310 may transmit each messagepublished to a topic to each subscriber of that topic, and resilientlystore the messages until acknowledged by each subscriber (or potentiallyuntil an error is logged with respect to a subscriber). As noted above,other models of communication are possible and contemplated within thepresent disclosure. For example, rather than subscribing to a topic onthe output ingestion buffer 310 and allowing the output ingestion buffer310 to initiate transmission of messages to the subscriber 602, theoutput ingestion buffer 310 may be configured to allow a subscriber 602to query the buffer 310 for messages (e.g., unread messages, newmessages since last transmission, etc.), and to initiate transmission ofthose messages form the buffer 310 to the subscriber 602. In someinstances, such querying may remove the need for the subscriber 602 toseparately “subscribe” to the topic.

Accordingly, at (16), after receiving a message to a topic, the outputingestion buffer 310 determines the subscribers to the topic (e.g.,based on prior subscription requests transmitted to the output ingestionbuffer 310). At (17), the output ingestion buffer 310 transmits themessage to a subscriber 402. Thereafter, the subscriber may process themessage at (18). Illustrative examples of such processing are describedbelow, and may include (for example) preparation of search results for aclient device 204, indexing of the data at the indexing system 212, andthe like. After processing, the subscriber can acknowledge the messageto the output ingestion buffer 310, thus confirming that the message hasbeen processed at the subscriber.

4.1.5 Data Resiliency and Security

In accordance with embodiments of the present disclosure, theinteractions of FIG. 6 may be ordered such that resiliency is maintainedat the intake system 210. Specifically, as disclosed above, datastreaming systems (which may be used to implement ingestion buffers) mayimplement a variety of techniques to ensure the resiliency of messagesstored at such systems, absent systematic or catastrophic failures.Thus, the interactions of FIG. 6 may be ordered such that data from adata source 202 is expected or guaranteed to be included in at least onemessage on an ingestion system until confirmation is received that thedata is no longer required.

For example, as shown in FIG. 6 , interaction (8)—wherein the streamingdata processors 308 acknowledges receipt of an initial message at theintake ingestion buffer 306—can illustratively occur after interaction(7)—wherein the streaming data processors 308 republishes the data tothe intake ingestion buffer 306. Similarly, interaction (15)—wherein thestreaming data processors 308 acknowledges receipt of an initial messageat the intake ingestion buffer 306—can illustratively occur afterinteraction (14)—wherein the streaming data processors 308 republishesthe data to the intake ingestion buffer 306. This ordering ofinteractions can ensure, for example, that the data being processed bythe streaming data processors 308 is, during that processing, alwaysstored at the ingestion buffer 306 in at least one message. Because aningestion buffer 306 can be configured to maintain and potentiallyresend messages until acknowledgement is received from each subscriber,this ordering of interactions can ensure that, should a device of thestreaming data processors 308 fail during processing, another deviceimplementing the streaming data processors 308 can later obtain the dataand continue the processing.

Similarly, as shown in FIG. 6 , each subscriber 402 may be configured toacknowledge a message to the output ingestion buffer 310 afterprocessing for the message is completed. In this manner, should asubscriber 402 fail after receiving a message but prior to completingprocessing of the message, the processing of the subscriber 402 can berestarted to successfully process the message. Thus, the interactions ofFIG. 6 can maintain resiliency of data on the intake system 108commensurate with the resiliency provided by an individual ingestionbuffer 306.

While message acknowledgement is described herein as an illustrativemechanism to ensure data resiliency at an intake system 210, othermechanisms for ensuring data resiliency may additionally oralternatively be used.

As will be appreciated in view of the present disclosure, theconfiguration and operation of the intake system 210 can further providehigh amounts of security to the messages of that system. Illustratively,the intake ingestion buffer 306 or output ingestion buffer 310 maymaintain an authorization record indicating specific devices or systemswith authorization to publish or subscribe to a specific topic on theingestion buffer. As such, an ingestion buffer may ensure that onlyauthorized parties are able to access sensitive data. In some instances,this security may enable multiple entities to utilize the intake system210 to manage confidential information, with little or no risk of thatinformation being shared between the entities. The managing of data orprocessing for multiple entities is in some instances referred to as“multi-tenancy.”

Illustratively, a first entity may publish messages to a first topic onthe intake ingestion buffer 306, and the intake ingestion buffer 306 mayverify that any intake point or data source 202 publishing to that firsttopic be authorized by the first entity to do so. The streaming dataprocessors 308 may maintain rules specific to the first entity, whichthe first entity may illustrative provide through authenticated sessionon an interface (e.g., GUI, API, command line interface (CLI), etc.).The rules of the first entity may specify one or more entity-specifictopics on the output ingestion buffer 310 to which messages containingdata of the first entity should be published by the streaming dataprocessors 308. The output ingestion buffer 310 may maintainauthorization records for such entity-specific topics, thus restrictingmessages of those topics to parties authorized by the first entity. Inthis manner, data security for the first entity can be ensured acrossthe intake system 210. Similar operations may be performed for otherentities, thus allowing multiple entities to separately andconfidentially publish data to and retrieve data from the intake system.

4.1.6 Message Processing Algorithm

With reference to FIG. 7 , an illustrative algorithm or routine forprocessing messages at the intake system 210 will be described in theform of a flowchart. The routine begins at block b102, where the intakesystem 210 obtains one or more rules for handling messages enqueued atan intake ingestion buffer 306. As noted above, the rules may, forexample, be human-generated, or may be automatically generated based onoperation of the data intake and query system 108 (e.g., in response touser submission of a query to the system 108).

At block 704, the intake system 210 obtains a message at the intakeingestion buffer 306. The message may be published to the intakeingestion buffer 306, for example, by the data retrieval subsystem 304(e.g., working in conjunction with a forwarder 302) and reflect dataobtained from a data source 202.

At block 706, the intake system 210 determines whether any obtained ruleapplies to the message. Illustratively, the intake system 210 (e.g., viathe streaming data processors 308) may apply selection criteria of eachrule to the message to determine whether the message satisfies theselection criteria. Thereafter, the routine varies according to whethera rule applies to the message. If no rule applies, the routine cancontinue to block 714, where the intake system 210 transmits anacknowledgement for the message to the intake ingestion buffer 306, thusenabling the buffer 306 to discard the message (e.g., once all othersubscribers have acknowledged the message). In some variations of theroutine, a “default rule” may be applied at the intake system 210, suchthat all messages are processed as least according to the default rule.The default rule may, for example, forward the message to an indexingtopic 342 for processing by an indexing system 212. In such aconfiguration, block 706 may always evaluate as true.

In the instance that at least one rule is determined to apply to themessage, the routine continues to block 708, where the intake system 210(e.g., via the streaming data processors 308) transforms the message asspecified by the applicable rule. For example, a processing sub-rule ofthe applicable rule may specify that data or metadata of the message beconverted from one format to another via an algorithmic transformation.As such, the intake system 210 may apply the algorithmic transformationto the data or metadata of the message at block 708 to transform thedata or metadata of the message. In some instances, no transformationmay be specified within intake system 210, and thus block 708 may beomitted.

At block 710, the intake system 210 determines a destination ingestionbuffer to which to publish the (potentially transformed) message, aswell as a topic to which the message should be published. Thedestination ingestion buffer and topic may be specified, for example, inprocessing sub-rules of the rule determined to apply to the message. Inone embodiment, the destination ingestion buffer and topic may varyaccording to the data or metadata of the message. In another embodiment,the destination ingestion buffer and topic may be fixed with respect toa particular rule.

At block 712, the intake system 210 publishes the (potentiallytransformed) message to the determined destination ingestion buffer andtopic. The determined destination ingestion buffer may be, for example,the intake ingestion buffer 306 or the output ingestion buffer 310.Thereafter, at block 714, the intake system 210 acknowledges the initialmessage on the intake ingestion buffer 306, thus enabling the intakeingestion buffer 306 to delete the message.

Thereafter, the routine returns to block 704, where the intake system210 continues to process messages from the intake ingestion buffer 306.Because the destination ingestion buffer determined during a priorimplementation of the routine may be the intake ingestion buffer 306,the routine may continue to process the same underlying data withinmultiple messages published on that buffer 306 (thus implementing aniterative processing loop with respect to that data). The routine maythen continue to be implemented during operation of the intake system210, such that data published to the intake ingestion buffer 306 isprocessed by the intake system 210 and made available on an outputingestion buffer 310 to downstream systems or components.

While the routine of FIG. 7 is described linearly, variousimplementations may involve concurrent or at least partially parallelprocessing. For example, in one embodiment, the intake system 210 isconfigured to process a message according to all rules determined toapply to that message. Thus for example if at block 706 five rules aredetermined to apply to the message, the intake system 210 may implementfive instances of blocks 708 through 714, each of which may transformthe message in different ways or publish the message to differentingestion buffers or topics. These five instances may be implemented inserial, parallel, or a combination thereof. Thus, the linear descriptionof FIG. 7 is intended simply for illustrative purposes.

While the routine of FIG. 7 is described with respect to a singlemessage, in some embodiments streaming data processors 308 may beconfigured to process multiple messages concurrently or as a batch.Similarly, all or a portion of the rules used by the streaming dataprocessors 308 may apply to sets or batches of messages. Illustratively,the streaming data processors 308 may obtain a batch of messages fromthe intake ingestion buffer 306 and process those messages according toa set of “batch” rules, whose criteria and/or processing sub-rules applyto the messages of the batch collectively. Such rules may, for example,determine aggregate attributes of the messages within the batch, sortmessages within the batch, group subsets of messages within the batch,and the like. In some instances, such rules may further alter messagesbased on aggregate attributes, sorting, or groupings. For example, arule may select the third messages within a batch, and perform aspecific operation on that message. As another example, a rule maydetermine how many messages within a batch are contained within aspecific group of messages. Various other examples for batch-based ruleswill be apparent in view of the present disclosure. Batches of messagesmay be determined based on a variety of criteria. For example, thestreaming data processors 308 may batch messages based on a thresholdnumber of messages (e.g., each thousand messages), based on timing(e.g., all messages received over a ten minute window), or based onother criteria (e.g., the lack of new messages posted to a topic withina threshold period of time).

4.2. Indexing

FIG. 8 is a data flow diagram illustrating an embodiment of the dataflow and communications between a variety of the components of the dataintake and query system 108 during indexing. Specifically, FIG. 8 is adata flow diagram illustrating an embodiment of the data flow andcommunications between an ingestion buffer 310, an indexing node manager406 or partition manager 408, an indexer 410, common storage 216, andthe data store catalog 220. However, it will be understood, that in someof embodiments, one or more of the functions described herein withrespect to FIG. 8 can be omitted, performed in a different order and/orperformed by a different component of the data intake and query system108. Accordingly, the illustrated embodiment and description should notbe construed as limiting.

At (1), the indexing node manager 406 activates a partition manager 408for a partition. As described herein, the indexing node manager 406 canactivate a partition manager 408 for each partition or shard that isprocessed by an indexing node 404. In some embodiments, the indexingnode manager 406 can activate the partition manager 408 based on anassignment of a new partition to the indexing node 404 or a partitionmanager 408 becoming unresponsive or unavailable, etc.

In some embodiments, the partition manager 408 can be a copy of theindexing node manager 406 or a copy of a template process. In certainembodiments, the partition manager 408 can be instantiated in a separatecontainer from the indexing node manager 406.

At (2), the ingestion buffer 310 sends data and a buffer location to theindexing node 212. As described herein, the data can be raw machinedata, performance metrics data, correlation data, JSON blobs, XML data,data in a datamodel, report data, tabular data, streaming data, dataexposed in an API, data in a relational database, etc. The bufferlocation can correspond to a marker in the ingestion buffer 310 thatindicates the point at which the data within a partition has beencommunicated to the indexing node 404. For example, data before themarker can correspond to data that has not been communicated to theindexing node 404, and data after the marker can correspond to data thathas been communicated to the indexing node. In some cases, the markercan correspond to a set of data that has been communicated to theindexing node 404, but for which no indication has been received thatthe data has been stored. Accordingly, based on the marker, theingestion buffer 310 can retain a portion of its data persistently untilit receives confirmation that the data can be deleted or has been storedin common storage 216.

At (3), the indexing node manager 406 tracks the buffer location and thepartition manager 408 communicates the data to the indexer 410. Asdescribed herein, the indexing node manager 406 can track (and/or store)the buffer location for the various partitions received from theingestion buffer 310. In addition, as described herein, the partitionmanager 408 can forward the data received from the ingestion buffer 310to the indexer 410 for processing. In various implementations, aspreviously described, the data from ingestion buffer 310 that is sent tothe indexer 410 may include a path to stored data, e.g., data stored incommon store 216 or another common store, which is then retrieved by theindexer 410 or another component of the indexing node 404.

At (4), the indexer 410 processes the data. As described herein, theindexer 410 can perform a variety of functions, enrichments, ortransformations on the data as it is indexed. For example, the indexer410 can parse the data, identify events from the data, identify andassociate timestamps with the events, associate metadata or one or morefield values with the events, group events (e.g., based on time,partition, and/or tenant ID, etc.), etc. Furthermore, the indexer 410can generate buckets based on a bucket creation policy and store theevents in the hot buckets, which may be stored in data store 412 of theindexing node 404 associated with that indexer 410 (see FIG. 4 ).

At (5), the indexer 410 reports the size of the data being indexed tothe partition manager 408. In some cases, the indexer 410 can routinelyprovide a status update to the partition manager 408 regarding the datathat is being processed by the indexer 410.

The status update can include, but is not limited to the size of thedata, the number of buckets being created, the amount of time since thebuckets have been created, etc. In some embodiments, the indexer 410 canprovide the status update based on one or more thresholds beingsatisfied (e.g., one or more threshold sizes being satisfied by theamount of data being processed, one or more timing thresholds beingsatisfied based on the amount of time the buckets have been created, oneor more bucket number thresholds based on the number of buckets created,the number of hot or warm buckets, number of buckets that have not beenstored in common storage 216, etc.).

In certain cases, the indexer 410 can provide an update to the partitionmanager 408 regarding the size of the data that is being processed bythe indexer 410 in response to one or more threshold sizes beingsatisfied. For example, each time a certain amount of data is added tothe indexer 410 (e.g., 5 MB, 10 MB, etc.), the indexer 410 can reportthe updated size to the partition manager 408. In some cases, theindexer 410 can report the size of the data stored thereon to thepartition manager 408 once a threshold size is satisfied.

In certain embodiments, the indexer 408 reports the size of the datebeing indexed to the partition manager 408 based on a query by thepartition manager 408. In certain embodiments, the indexer 410 andpartition manager 408 maintain an open communication link such that thepartition manager 408 is persistently aware of the amount of data on theindexer 410.

In some cases, a partition manager 408 monitors the data processed bythe indexer 410. For example, the partition manager 408 can track thesize of the data on the indexer 410 that is associated with thepartition being managed by the partition manager 408. In certain cases,one or more partition managers 408 can track the amount or size of thedata on the indexer 410 that is associated with any partition beingmanaged by the indexing node manager 406 or that is associated with theindexing node 404.

At (6), the partition manager 408 instructs the indexer 410 to copy thedata to common storage 216. As described herein, the partition manager408 can instruct the indexer 410 to copy the data to common storage 216based on a bucket roll-over policy. As described herein, in some cases,the bucket roll-over policy can indicate that one or more buckets are tobe rolled over based on size. Accordingly, in some embodiments, thepartition manager 408 can instruct the indexer 410 to copy the data tocommon storage 216 based on a determination that the amount of datastored on the indexer 410 satisfies a threshold amount. The thresholdamount can correspond to the amount of data associated with thepartition that is managed by the partition manager 408 or the amount ofdata being processed by the indexer 410 for any partition.

In some cases, the partition manager 408 can instruct the indexer 410 tocopy the data that corresponds to the partition being managed by thepartition manager 408 to common storage 216 based on the size of thedata that corresponds to the partition satisfying the threshold amount.In certain embodiments, the partition manager 408 can instruct theindexer 410 to copy the data associated with any partition beingprocessed by the indexer 410 to common storage 216 based on the amountof the data from the partitions that are being processed by the indexer410 satisfying the threshold amount.

In some embodiments, (5) and/or (6) can be omitted. For example, theindexer 410 can monitor the data stored thereon. Based on the bucketroll-over policy, the indexer 410 can determine that the data is to becopied to common storage 216. Accordingly, in some embodiments, theindexer 410 can determine that the data is to be copied to commonstorage 216 without communication with the partition manager 408.

At (7), the indexer 410 copies and/or stores the data to common storage216. As described herein, in some cases, as the indexer 410 processesthe data, it generates events and stores the events in hot buckets. Inresponse to receiving the instruction to move the data to common storage216, the indexer 410 can convert the hot buckets to warm buckets, andcopy or move the warm buckets to the common storage 216.

As part of storing the data to common storage 216, the indexer 410 canverify or obtain acknowledgements that the data is stored successfully.In some embodiments, the indexer 410 can determine information regardingthe data stored in the common storage 216. For example, the informationcan include location information regarding the data that was stored tothe common storage 216, bucket identifiers of the buckets that werecopied to common storage 216, as well as additional information, e.g.,in implementations in which the ingestion buffer 310 uses sequences ofrecords as the form for data storage, the list of record sequencenumbers that were used as part of those buckets that were copied tocommon storage 216.

At (8), the indexer 410 reports or acknowledges to the partition manager408 that the data is stored in the common storage 216. In variousimplementations, this can be in response to periodic requests from thepartition manager 408 to the indexer 410 regarding which buckets and/ordata have been stored to common storage 216. The indexer 410 can providethe partition manager 408 with information regarding the data stored incommon storage 216 similar to the data that is provided to the indexer410 by the common storage 216. In some cases, (8) can be replaced withthe common storage 216 acknowledging or reporting the storage of thedata to the partition manager 408.

At (9), the partition manager 408 updates the data store catalog 220. Asdescribed herein, the partition manager 408 can update the data storecatalog 220 with information regarding the data or buckets stored incommon storage 216. For example, the partition manager 408 can updatethe data store catalog 220 to include location information, a bucketidentifier, a time range, and tenant and partition information regardingthe buckets copied to common storage 216, etc. In this way, the datastore catalog 220 can include up-to-date information regarding thebuckets stored in common storage 216.

At (10), the partition manager 408 reports the completion of the storageto the ingestion buffer 310, and at (11), the ingestion buffer 310updates the buffer location or marker. Accordingly, in some embodiments,the ingestion buffer 310 can maintain its marker until it receives anacknowledgement that the data that it sent to the indexing node 404 hasbeen indexed by the indexing node 404 and stored to common storage 216.In addition, the updated buffer location or marker can be communicatedto and stored by the indexing node manager 406. In this way, a dataintake and query system 108 can use the ingestion buffer 310 to providea stateless environment for the indexing system 212. For example, asdescribed herein, if an indexing node 404 or one of its components(e.g., indexing node manager 486, partition manager 408, indexer)becomes unavailable or unresponsive before data from the ingestionbuffer 310 is copied to common storage 216, the indexing system 212 cangenerate or assign a new indexing node 404 (or component), to processthe data that was assigned to the now unavailable indexing node 404 (orcomponent) while reducing, minimizing, or eliminating data loss.

At (12), a bucket manager 414, which may form part of the indexer 410,the indexing node 404, or indexing system 212, merges multiple bucketsinto one or more merged buckets. As described herein, to reduce delaybetween processing data and making that data available for searching,the indexer 410 can convert smaller hot buckets to warm buckets and copythe warm buckets to common storage 216. However, as smaller buckets incommon storage 216 can result in increased overhead and storage costs,the bucket manager 414 can monitor warm buckets in the indexer 410 andmerge the warm buckets into one or more merged buckets.

In some cases, the bucket manager 414 can merge the buckets according toa bucket merge policy. As described herein, the bucket merge policy canindicate which buckets are candidates for a merge (e.g., based on timeranges, size, tenant/partition or other identifiers, etc.), the numberof buckets to merge, size or time range parameters for the mergedbuckets, a frequency for creating the merged buckets, etc.

At (13), the bucket manager 414 stores and/or copies the merged data orbuckets to common storage 216, and obtains information about the mergedbuckets stored in common storage 216. Similar to (7), the obtainedinformation can include information regarding the storage of the mergedbuckets, such as, but not limited to, the location of the buckets, oneor more bucket identifiers, tenant or partition identifiers, etc. At(14), the bucket manager 414 reports the storage of the merged data tothe partition manager 408, similar to the reporting of the data storageat (8).

At (15), the indexer 410 deletes data from the data store (e.g., datastore 412). As described herein, once the merged buckets have beenstored in common storage 216, the indexer 410 can delete correspondingbuckets that it has stored locally. For example, the indexer 410 candelete the merged buckets from the data store 412, as well as thepre-merged buckets (buckets used to generate the merged buckets). Byremoving the data from the data store 412, the indexer 410 can free upadditional space for additional hot buckets, warm buckets, and/or mergedbuckets.

At (16), the common storage 216 deletes data according to a bucketmanagement policy. As described herein, once the merged buckets havebeen stored in common storage 216, the common storage 216 can delete thepre-merged buckets stored therein. In some cases, as described herein,the common storage 216 can delete the pre-merged buckets immediately,after a predetermined amount of time, after one or more queries relyingon the pre-merged buckets have completed, or based on other criteria inthe bucket management policy, etc. In certain embodiments, a controllerat the common storage 216 handles the deletion of the data in commonstorage 216 according to the bucket management policy. In certainembodiments, one or more components of the indexing node 404 delete thedata from common storage 216 according to the bucket management policy.However, for simplicity, reference is made to common storage 216performing the deletion.

At (17), the partition manager 408 updates the data store catalog 220with the information about the merged buckets. Similar to (9), thepartition manager 408 can update the data store catalog 220 with themerged bucket information. The information can include, but is notlimited to, the time range of the merged buckets, location of the mergedbuckets in common storage 216, a bucket identifier for the mergedbuckets, tenant and partition information of the merged buckets, etc. Inaddition, as part of updating the data store catalog 220, the partitionmanager 408 can remove reference to the pre-merged buckets. Accordingly,the data store catalog 220 can be revised to include information aboutthe merged buckets and omit information about the pre-merged buckets. Inthis way, as the search managers 514 request information about bucketsin common storage 216 from the data store catalog 220, the data storecatalog 220 can provide the search managers 514 with the merged bucketinformation.

As mentioned previously, in some of embodiments, one or more of thefunctions described herein with respect to FIG. 8 can be omitted,performed in a variety of orders and/or performed by a differentcomponent of the data intake and query system 108. For example, thepartition manager 408 can (9) update the data store catalog 220 before,after, or concurrently with the deletion of the data in the (15) indexer410 or (16) common storage 216. Similarly, in certain embodiments, theindexer 410 can (12) merge buckets before, after, or concurrently with(7)-(11), etc.

4.2.1. Containerized Indexing Nodes

FIG. 9 is a flow diagram illustrative of an embodiment of a routine 900implemented by the indexing system 212 to store data in common storage216. Although described as being implemented by the indexing system 212,it will be understood that the elements outlined for routine 900 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the indexing manager 402, the indexing node 404, indexingnode manager 406, the partition manager 408, the indexer 410, the bucketmanager 414, etc. Thus, the following illustrative embodiment should notbe construed as limiting.

At block 902, the indexing system 212 receives data. As describedherein, the system 312 can receive data from a variety of sources invarious formats. For example, as described herein, the data received canbe machine data, performance metrics, correlated data, etc.

At block 904, the indexing system 212 stores the data in buckets usingone or more containerized indexing nodes 404. As described herein, theindexing system 212 can include multiple containerized indexing nodes404 to receive and process the data. The containerized indexing nodes404 can enable the indexing system 212 to provide a highly extensibleand dynamic indexing service. For example, based on resourceavailability and/or workload, the indexing system 212 can instantiateadditional containerized indexing nodes 404 or terminate containerizedindexing nodes 404. Further, multiple containerized indexing nodes 404can be instantiated on the same computing device, and share theresources of the computing device.

As described herein, each indexing node 404 can be implemented usingcontainerization or operating-system-level virtualization, or othervirtualization technique. For example, the indexing node 404, or one ormore components of the indexing node 404 can be implemented as separatecontainers or container instances. Each container instance can havecertain resources (e.g., memory, processor, etc.) of the underlyingcomputing system assigned to it, but may share the same operating systemand may use the operating system' s system call interface. Further, eachcontainer may run the same or different computer applicationsconcurrently or separately, and may interact with each other. It will beunderstood that other virtualization techniques can be used. Forexample, the containerized indexing nodes 404 can be implemented usingvirtual machines using full virtualization or paravirtualization, etc.

In some embodiments, the indexing node 404 can be implemented as a groupof related containers or a pod, and the various components of theindexing node 404 can be implemented as related containers of a pod.Further, the indexing node 404 can assign different containers toexecute different tasks. For example, one container of a containerizedindexing node 404 can receive the incoming data and forward it to asecond container for processing, etc. The second container can generatebuckets for the data, store the data in buckets, and communicate thebuckets to common storage 216. A third container of the containerizedindexing node 404 can merge the buckets into merged buckets and storethe merged buckets in common storage. However, it will be understoodthat the containerized indexing node 404 can be implemented in a varietyof configurations. For example, in some cases, the containerizedindexing node 404 can be implemented as a single container and caninclude multiple processes to implement the tasks described above by thethree containers. Any combination of containerization and processed canbe used to implement the containerized indexing node 404 as desired.

In some embodiments, the containerized indexing node 404 processes thereceived data (or the data obtained using the received data) and storesit in buckets. As part of the processing, the containerized indexingnode 404 can determine information about the data (e.g., host, source,sourcetype), extract or identify timestamps, associated metadata fieldswith the data, extract keywords, transform the data, identify andorganize the data into events having raw machine data associated with atimestamp, etc. In some embodiments, the containerized indexing node 404uses one or more configuration files and/or extraction rules to extractinformation from the data or events.

In addition, as part of processing and storing the data, thecontainerized indexing node 404 can generate buckets for the dataaccording to a bucket creation policy. As described herein, thecontainerized indexing node 404 can concurrently generate and fillmultiple buckets with the data that it processes. In some embodiments,the containerized indexing node 404 generates buckets for each partitionor tenant associated with the data that is being processed. In certainembodiments, the indexing node 404 stores the data or events in thebuckets based on the identified timestamps.

Furthermore, containerized indexing node 404 can generate one or moreindexes associated with the buckets, such as, but not limited to, one ormore inverted indexes, TSIDXs, keyword indexes, etc. The data and theindexes can be stored in one or more files of the buckets. In addition,the indexing node 404 can generate additional files for the buckets,such as, but not limited to, one or more filter files, a bucket summary,or manifest, etc.

At block 906, the indexing node 404 stores buckets in common storage216. As described herein, in certain embodiments, the indexing node 404stores the buckets in common storage 216 according to a bucket roll-overpolicy. In some cases, the buckets are stored in common storage 216 inone or more directories based on an index/partition or tenant associatedwith the buckets. Further, the buckets can be stored in a time seriesmanner to facilitate time series searching as described herein.Additionally, as described herein, the common storage 216 can replicatethe buckets across multiple tiers and data stores across one or moregeographical locations.

Fewer, more, or different blocks can be used as part of the routine 900.In some cases, one or more blocks can be omitted. For example, in someembodiments, the containerized indexing node 404 or a indexing systemmanager 402 can monitor the amount of data received by the indexingsystem 212. Based on the amount of data received and/or a workload orutilization of the containerized indexing node 404, the indexing system212 can instantiate an additional containerized indexing node 404 toprocess the data.

In some cases, the containerized indexing node 404 can instantiate acontainer or process to manage the processing and storage of data froman additional shard or partition of data received from the intakesystem. For example, as described herein, the containerized indexingnode 404 can instantiate a partition manager 408 for each partition orshard of data that is processed by the containerized indexing node 404.

In certain embodiments, the indexing node 404 can delete locally storedbuckets. For example, once the buckets are stored in common storage 216,the indexing node 404 can delete the locally stored buckets. In thisway, the indexing node 404 can reduce the amount of data stored thereon.

As described herein, the indexing node 404 can merge buckets and storemerged buckets in the common storage 216. In some cases, as part ofmerging and storing buckets in common storage 216, the indexing node 404can delete locally storage pre-merged buckets (buckets used to generatethe merged buckets) and/or the merged buckets or can instruct the commonstorage 216 to delete the pre-merged buckets. In this way, the indexingnode 404 can reduce the amount of data stored in the indexing node 404and/or the amount of data stored in common storage 216.

In some embodiments, the indexing node 404 can update a data storecatalog 220 with information about pre-merged or merged buckets storedin common storage 216. As described herein, the information can identifythe location of the buckets in common storage 216 and other information,such as, but not limited to, a partition or tenant associated with thebucket, time range of the bucket, etc. As described herein, theinformation stored in the data store catalog 220 can be used by thequery system 214 to identify buckets to be searched as part of a query.

Furthermore, it will be understood that the various blocks describedherein with reference to FIG. 9 can be implemented in a variety oforders, or can be performed concurrently. For example, the indexing node404 can concurrently convert buckets and store them in common storage216, or concurrently receive data from a data source and process datafrom the data source, etc.

4.2.2. Moving Buckets to Common Storage

FIG. 10 is a flow diagram illustrative of an embodiment of a routine1000 implemented by the indexing node 404 to store data in commonstorage 216. Although described as being implemented by the indexingnode 404, it will be understood that the elements outlined for routine1000 can be implemented by one or more computing devices/components thatare associated with the data intake and query system 108, such as, butnot limited to, the indexing manager 402, the indexing node manager 406,the partition manager 408, the indexer 410, the bucket manager 414, etc.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 1002, the indexing node 404 receives data. As described herein,the indexing node 404 can receive data from a variety of sources invarious formats. For example, as described herein, the data received canbe machine data, performance metrics, correlated data, etc.

Further, as described herein, the indexing node 404 can receive datafrom one or more components of the intake system 210 (e.g., theingesting buffer 310, forwarder 302, etc.) or other data sources 202. Insome embodiments, the indexing node 404 can receive data from a shard orpartition of the ingestion buffer 310. Further, in certain cases, theindexing node 404 can generate a partition manager 408 for each shard orpartition of a data stream. In some cases, the indexing node 404receives data from the ingestion buffer 310 that references or points todata stored in one or more data stores, such as a data store 218 ofcommon storage 216, or other network accessible data store or cloudstorage. In such embodiments, the indexing node 404 can obtain the datafrom the referenced data store using the information received from theingestion buffer 310.

At block 1004, the indexing node 404 stores data in buckets. In someembodiments, the indexing node 404 processes the received data (or thedata obtained using the received data) and stores it in buckets. As partof the processing, the indexing node 404 can determine information aboutthe data (e.g., host, source, sourcetype), extract or identifytimestamps, associated metadata fields with the data, extract keywords,transform the data, identify and organize the data into events havingraw machine data associated with a timestamp, etc. In some embodiments,the indexing node 404 uses one or more configuration files and/orextraction rules to extract information from the data or events.

In addition, as part of processing and storing the data, the indexingnode 404 can generate buckets for the data according to a bucketcreation policy. As described herein, the indexing node 404 canconcurrently generate and fill multiple buckets with the data that itprocesses. In some embodiments, the indexing node 404 generates bucketsfor each partition or tenant associated with the data that is beingprocessed. In certain embodiments, the indexing node 404 stores the dataor events in the buckets based on the identified timestamps.

Furthermore, indexing node 404 can generate one or more indexesassociated with the buckets, such as, but not limited to, one or moreinverted indexes, TSIDXs, keyword indexes, bloom filter files, etc. Thedata and the indexes can be stored in one or more files of the buckets.In addition, the indexing node 404 can generate additional files for thebuckets, such as, but not limited to, one or more filter files, abuckets summary, or manifest, etc.

At block 1006, the indexing node 404 monitors the buckets. As describedherein, the indexing node 404 can process significant amounts of dataacross a multitude of buckets, and can monitor the size or amount ofdata stored in individual buckets, groups of buckets or all the bucketsthat it is generating and filling. In certain embodiments, one componentof the indexing node 404 can monitor the buckets (e.g., partitionmanager 408), while another component fills the buckets (e.g., indexer410).

In some embodiments, as part of monitoring the buckets, the indexingnode 404 can compare the individual size of the buckets or thecollective size of multiple buckets with a threshold size. Once thethreshold size is satisfied, the indexing node 404 can determine thatthe buckets are to be stored in common storage 216. In certainembodiments, the indexing node 404 can monitor the amount of time thathas passed since the buckets have been stored in common storage 216.Based on a determination that a threshold amount of time has passed, theindexing node 404 can determine that the buckets are to be stored incommon storage 216. Further, it will be understood that the indexingnode 404 can use a bucket roll-over policy and/or a variety oftechniques to determine when to store buckets in common storage 216.

At block 1008, the indexing node 404 converts the buckets. In somecases, as part of preparing the buckets for storage in common storage216, the indexing node 404 can convert the buckets from editable bucketsto non-editable buckets. In some cases, the indexing node 404 converthot buckets to warm buckets based on the bucket roll-over policy. Thebucket roll-over policy can indicate that buckets are to be convertedfrom hot to warm buckets based on a predetermined period of time, one ormore buckets satisfying a threshold size, the number of hot buckets,etc. In some cases, based on the bucket roll-over policy, the indexingnode 404 converts hot buckets to warm buckets based on a collective sizeof multiple hot buckets satisfying a threshold size. The multiple hotbuckets can correspond to any one or any combination of randomlyselected hot buckets, hot buckets associated with a particular partitionor shard (or partition manager 408), hot buckets associated with aparticular tenant or partition, all hot buckets in the data store 412 orbeing processed by the indexer 410, etc.

At block 1010, the indexing node 404 stores the converted buckets in adata store. As described herein, the indexing node 404 can store thebuckets in common storage 216 or other location accessible to the querysystem 214. In some cases, the indexing node 404 stores a copy of thebuckets in common storage 416 and retains the original bucket in itsdata store 412. In certain embodiments, the indexing node 404 stores acopy of the buckets in common storage and deletes any reference to theoriginal buckets in its data store 412.

Furthermore, as described herein, in some cases, the indexing node 404can store the one or more buckets based on the bucket roll-over policy.In addition to indicating when buckets are to be converted from hotbuckets to warm buckets, the bucket roll-over policy can indicate whenbuckets are to be stored in common storage 216. In some cases, thebucket roll-over policy can use the same or different policies orthresholds to indicate when hot buckets are to be converted to warm andwhen buckets are to be stored in common storage 216.

In certain embodiments, the bucket roll-over policy can indicate thatbuckets are to be stored in common storage 216 based on a collectivesize of buckets satisfying a threshold size. As mentioned, the thresholdsize used to determine that the buckets are to be stored in commonstorage 216 can be the same as or different from the threshold size usedto determine that editable buckets should be converted to non-editablebuckets. Accordingly, in certain embodiments, based on a determinationthat the size of the one or more buckets have satisfied a thresholdsize, the indexing node 404 can convert the buckets to non-editablebuckets and store the buckets in common storage 216.

Other thresholds and/or other factors or combinations of thresholds andfactors can be used as part of the bucket roll-over policy. For example,the bucket roll-over policy can indicate that buckets are to be storedin common storage 216 based on the passage of a threshold amount oftime. As yet another example, bucket roll-over policy can indicate thatbuckets are to be stored in common storage 216 based on the number ofbuckets satisfying a threshold number.

It will be understood that the bucket roll-over policy can use a varietyof techniques or thresholds to indicate when to store the buckets incommon storage 216. For example, in some cases, the bucket roll-overpolicy can use any one or any combination of a threshold time period,threshold number of buckets, user information, tenant or partitioninformation, query frequency, amount of data being received, time of dayor schedules, etc., to indicate when buckets are to be stored in commonstorage 216 (and/or converted to non-editable buckets). In some cases,the bucket roll-over policy can use different priorities to determinehow to store the buckets, such as, but not limited to, minimizing orreducing time between processing and storage to common storage 216,maximizing or increasing individual bucket size, etc. Furthermore, thebucket roll-over policy can use dynamic thresholds to indicate whenbuckets are to be stored in common storage 216.

As mentioned, in some cases, based on an increased query frequency, thebucket roll-over policy can indicate that buckets are to be moved tocommon storage 216 more frequently by adjusting one more thresholds usedto determine when the buckets are to be stored to common storage 216(e.g., threshold size, threshold number, threshold time, etc.).

In addition, the bucket roll-over policy can indicate that differentsets of buckets are to be rolled-over differently or at different ratesor frequencies. For example, the bucket roll-over policy can indicatethat buckets associated with a first tenant or partition are to berolled over according to one policy and buckets associated with a secondtenant or partition are to be rolled over according to a differentpolicy. The different policies may indicate that the buckets associatedwith the first tenant or partition are to be stored more frequently tocommon storage 216 than the buckets associated with the second tenant orpartition. Accordingly, the bucket roll-over policy can use one set ofthresholds (e.g., threshold size, threshold number, and/or thresholdtime, etc.) to indicate when the buckets associated with the firsttenant or partition are to be stored in common storage 216 and adifferent set of thresholds for the buckets associated with the secondtenant or partition.

As another non-limiting example, consider a scenario in which bucketsfrom a partition _main are being queried more frequently than bucketfrom the partition _test. The bucket roll-over policy can indicate thatbased on the increased frequency of queries for buckets from partition_main, buckets associated with partition _main should be moved morefrequently to common storage 216, for example, by adjusting thethreshold size used to determine when to store the buckets in commonstorage 216. In this way, the query system 214 can obtain relevantsearch results more quickly for data associated with the _mainpartition. Further, if the frequency of queries for buckets from the_main partition decreases, the data intake and query system 108 canadjust the threshold accordingly. In addition, the bucket roll-overpolicy may indicate that the changes are only for buckets associatedwith the partition _main or that the changes are to be made for allbuckets, or all buckets associated with a particular tenant that isassociated with the partition _main, etc.

Furthermore, as mentioned, the bucket roll-over policy can indicate thatbuckets are to be stored in common storage 216 at different rates orfrequencies based on time of day. For example, the data intake and querysystem 108 can adjust the thresholds so that the buckets are moved tocommon storage 216 more frequently during working hours and lessfrequently during non-working hours. In this way, the delay betweenprocessing and making the data available for searching during workinghours can be reduced, and can decrease the amount of merging performedon buckets generated during non-working hours. In other cases, the dataintake and query system 108 can adjust the thresholds so that thebuckets are moved to common storage 216 less frequently during workinghours and more frequently during non-working hours.

As mentioned, the bucket roll-over policy can indicate that based on anincreased rate at which data is received, buckets are to be moved tocommon storage more (or less) frequently. For example, if the bucketroll-over policy initially indicates that the buckets are to be storedevery millisecond, as the rate of data received by the indexing node 404increases, the amount of data received during each millisecond canincrease, resulting in more data waiting to be stored. As such, in somecases, the bucket roll-over policy can indicate that the buckets are tobe stored more frequently in common storage 216. Further, in some cases,such as when a collective bucket size threshold is used, an increasedrate at which data is received may overburden the indexing node 404 dueto the overhead associated with copying each bucket to common storage216. As such, in certain cases, the bucket roll-over policy can use alarger collective bucket size threshold to indicate that the buckets areto be stored in common storage 216. In this way, the bucket roll-overpolicy can reduce the ratio of overhead to data being stored.

Similarly, the bucket roll-over policy can indicate that certain usersare to be treated differently. For example, if a particular user islogged in, the bucket roll-over policy can indicate that the buckets inan indexing node 404 are to be moved to common storage 216 more or lessfrequently to accommodate the user's preferences, etc. Further, asmentioned, in some embodiments, the data intake and query system 108 mayindicate that only those buckets associated with the user (e.g., basedon tenant information, indexing information, user information, etc.) areto be stored more or less frequently.

Furthermore, the bucket roll-over policy can indicate whether, aftercopying buckets to common storage 216, the locally stored buckets are tobe retained or discarded. In some cases, the bucket roll-over policy canindicate that the buckets are to be retained for merging. In certaincases, the bucket roll-over policy can indicate that the buckets are tobe discarded.

Fewer, more, or different blocks can be used as part of the routine1000. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the indexing node 404 may not convert the bucketsbefore storing them. As another example, the routine 1000 can includenotifying the data source, such as the intake system, that the bucketshave been uploaded to common storage, merging buckets and uploadingmerged buckets to common storage, receiving identifying informationabout the buckets in common storage 216 and updating a data storecatalog 220 with the received information, etc.

Furthermore, it will be understood that the various blocks describedherein with reference to FIG. 10 can be implemented in a variety oforders, or can be performed concurrently. For example, the indexing node404 can concurrently convert buckets and store them in common storage216, or concurrently receive data from a data source and process datafrom the data source, etc.

4.2.3. Updating Location Marker in Ingestion Buffer

FIG. 11 is a flow diagram illustrative of an embodiment of a routine1100 implemented by the indexing node 404 to update a location marker inan ingestion buffer, e.g., ingestion buffer 310. Although described asbeing implemented by the indexing node 404, it will be understood thatthe elements outlined for routine 1100 can be implemented by one or morecomputing devices/components that are associated with the data intakeand query system 108, such as, but not limited to, the indexing manager402, the indexing node manager 406, the partition manager 408, theindexer 410, the bucket manager 414, etc. Thus, the followingillustrative embodiment should not be construed as limiting. Moreover,although the example refers to updating a location marker in ingestionbuffer 310, other implementations can include other ingestion componentswith other types of location tracking that can be updated in a similarmanner as the location marker.

At block 1102, the indexing node 404 receives data. As described ingreater detail above with reference to block 1002, the indexing node 404can receive a variety of types of data from a variety of sources.

In some embodiments, the indexing node 404 receives data from aningestion buffer 310. As described herein, the ingestion buffer 310 canoperate according to a pub-sub messaging service. As such, the ingestionbuffer 310 can communicate data to the indexing node 404, and alsoensure that the data is available for additional reads until it receivesan acknowledgement from the indexing node 404 that the data can beremoved.

In some cases, the ingestion buffer 310 can use one or more readpointers or location markers to track the data that has beencommunicated to the indexing node 404 but that has not been acknowledgedfor removal. As the ingestion buffer 310 receives acknowledgments fromthe indexing node 404, it can update the location markers. In somecases, such as where the ingestion buffer 310 uses multiple partitionsor shards to provide the data to the indexing node 404, the ingestionbuffer 310 can include at least one location marker for each partitionor shard. In this way, the ingestion buffer 310 can separately track theprogress of the data reads in the different shards.

In certain embodiments, the indexing node 404 can receive (and/or store)the location markers in addition to or as part of the data received fromthe ingestion buffer 310. Accordingly, the indexing node 404 can trackthe location of the data in the ingestion buffer 310 that the indexingnode 404 has received from the ingestion buffer 310. In this way, if anindexer 410 or partition manager 408 becomes unavailable or fails, theindexing node 404 can assign a different indexer 410 or partitionmanager 408 to process or manage the data from the ingestion buffer 310and provide the indexer 410 or partition manager 408 with a locationfrom which the indexer 410 or partition manager 408 can obtain the data.

At block 1104, the indexing node 404 stores the data in buckets. Asdescribed in greater detail above with reference to block 1004 of FIG.10 , as part of storing the data in buckets, the indexing node 404 canparse the data, generate events, generate indexes of the data, compressthe data, etc. In some cases, the indexing node 404 can store the datain hot or warm buckets and/or convert hot buckets to warm buckets basedon the bucket roll-over policy.

At block 1106, the indexing node 404 stores buckets in common storage216. As described herein, in certain embodiments, the indexing node 404stores the buckets in common storage 216 according to the bucketroll-over policy. In some cases, the buckets are stored in commonstorage 216 in one or more directories based on an index/partition ortenant associated with the buckets. Further, the buckets can be storedin a time series manner to facilitate time series searching as describedherein. Additionally, as described herein, the common storage 216 canreplicate the buckets across multiple tiers and data stores across oneor more geographical locations. In some cases, in response to thestorage, the indexing node 404 receives an acknowledgement that the datawas stored. Further, the indexing node 404 can receive information aboutthe location of the data in common storage, one or more identifiers ofthe stored data, etc. The indexing node 404 can use this information toupdate the data store catalog 220.

At block 1108, the indexing node 404 notifies an ingestion buffer 310that the data has been stored in common storage 216. As describedherein, in some cases, the ingestion buffer 310 can retain locationmarkers for the data that it sends to the indexing node 404. Theingestion buffer 310 can use the location markers to indicate that thedata sent to the indexing node 404 is to be made persistently availableto the indexing system 212 until the ingestion buffer 310 receives anacknowledgement from the indexing node 404 that the data has been storedsuccessfully. In response to the acknowledgement, the ingestion buffer310 can update the location marker(s) and communicate the updatedlocation markers to the indexing node 404. The indexing node 404 canstore updated location markers for use in the event one or morecomponents of the indexing node 404 (e.g., partition manager 408,indexer 410) become unavailable or fail. In this way, the ingestionbuffer 310 and the location markers can aid in providing a statelessindexing service.

Fewer, more, or different blocks can be used as part of the routine1100. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the indexing node 404 can update the data storecatalog 220 with information about the buckets created by the indexingnode 404 and/or stored in common storage 215, as described herein.

Furthermore, it will be understood that the various blocks describedherein with reference to FIG. 11 can be implemented in a variety oforders. In some cases, the indexing node 404 can implement some blocksconcurrently or change the order as desired. For example, the indexingnode 404 can concurrently receive data, store other data in buckets, andstore buckets in common storage.

4.2.4. Merging Buckets

FIG. 12 is a flow diagram illustrative of an embodiment of a routine1200 implemented by the indexing node 404 to merge buckets. Althoughdescribed as being implemented by the indexing node 404, it will beunderstood that the elements outlined for routine 1200 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the indexing manager 402, the indexing node manager 406, thepartition manager 408, the indexer 410, the bucket manager 414, etc.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 1202, the indexing node 404 stores data in buckets. Asdescribed herein, the indexing node 404 can process various types ofdata from a variety of sources. Further, the indexing node 404 cancreate one or more buckets according to a bucket creation policy andstore the data in the store the data in one or more buckets. Inaddition, in certain embodiments, the indexing node 404 can convert hotor editable buckets to warm or non-editable buckets according to abucket roll-over policy.

At block 1204, the indexing node 404 stores buckets in common storage216. As described herein, the indexing node 404 can store the buckets incommon storage 216 according to the bucket roll-over policy. In somecases, the buckets are stored in common storage 216 in one or moredirectories based on an index/partition or tenant associated with thebuckets. Further, the buckets can be stored in a time series manner tofacilitate time series searching as described herein. Additionally, asdescribed herein, the common storage 216 can replicate the bucketsacross multiple tiers and data stores across one or more geographicallocations.

At block 1206, the indexing node 404 updates the data store catalog 220.As described herein, in some cases, in response to the storage, theindexing node 404 receives an acknowledgement that the data was stored.Further, the indexing node 404 can receive information about thelocation of the data in common storage, one or more identifiers of thestored data, etc. The received information can be used by the indexingnode 404 to update the data store catalog 220. In addition, the indexingnode 404 can provide the data store catalog 220 with any one or anycombination of the tenant or partition associated with the bucket, atime range of the events in the bucket, one or more metadata fields ofthe bucket (e.g., host, source, sourcetype, etc.), etc. In this way, thedata store catalog 220 can store up-to-date information about thebuckets in common storage 216. Further, this information can be used bythe query system 214 to identify relevant buckets for a query.

In some cases, the indexing node 404 can update the data store catalog220 before, after, or concurrently with storing the data to commonstorage 216. For example, as buckets are created by the indexing node404, the indexing node 404 can update the data store catalog 220 withinformation about the created buckets, such as, but not limited to, anpartition or tenant associated with the bucket, a time range or initialtime (e.g., time of earliest-in-time timestamp), etc. In addition, theindexing node 404 can include an indication that the bucket is a hotbucket or editable bucket and that the contents of the bucket are not(yet) available for searching or in the common storage 216.

As the bucket is filled with events or data, the indexing node 404 canupdate the data store catalog 220 with additional information about thebucket (e.g., updated time range based on additional events, size of thebucket, number of events in the bucket, certain keywords or metadatafrom the bucket, such as, but not limited to a host, source, orsourcetype associated with different events in the bucket, etc.).Further, once the bucket is uploaded to common storage 216, the indexingnode 404 can complete the entry for the bucket, such as, by providing acompleted time range, location information of the bucket in commonstorage 216, completed keyword or metadata information as desired, etc.

The information in the data store catalog 220 can be used by the querysystem 214 to execute queries. In some cases, based on the informationin the data store catalog 220 about buckets that are not yet availablefor searching, the query system 214 can wait until the data is availablefor searching before completing the query or inform a user that somedata that may be relevant has not been processed or that the resultswill be updated. Further, in some cases, the query system 214 can informthe indexing system 212 about the bucket, and the indexing system 212can cause the indexing node 404 to store the bucket in common storage216 sooner than it otherwise would without the communication from thequery system 214.

In addition, the indexing node 404 can update the data store catalog 220with information about buckets to be merged. For example, once one ormore buckets are identified for merging, the indexing node 404 canupdate an entry for the buckets in the data store catalog 220 indicatingthat they are part of a merge operation and/or will be replaced. In somecases, as part of the identification, the data store catalog 220 canprovide information about the entries to the indexing node 404 formerging. As the entries may have summary information about the buckets,the indexing node 404 can use the summary information to generate amerged entry for the data store catalog 220 as opposed to generating thesummary information from the merged data itself. In this way, theinformation from the data store catalog 220 can increase the efficiencyof a merge operation by the indexing node 404.

At block 1208, the indexing node 404 merges buckets. In someembodiments, the indexing node 404 can merge buckets according to abucket merge policy. As described herein, the bucket merge policy canindicate which buckets to merge, when to merge buckets and one or moreparameters for the merged buckets (e.g., time range for the mergedbuckets, size of the merged buckets, etc.). For example, the bucketmerge policy can indicate that only buckets associated with the sametenant identifier and/or partition can be merged. As another example,the bucket merge policy can indicate that only buckets that satisfy athreshold age (e.g., have existed or been converted to warm buckets formore than a set period of time) are eligible for a merge. Similarly, thebucket merge policy can indicate that each merged bucket must be atleast 750 MB or no greater than 1 GB, or cannot have a time range thatexceeds a predetermined amount or is larger than 75% of other buckets.The other buckets can refer to one or more buckets in common storage 216or similar buckets (e.g., buckets associated with the same tenant,partition, host, source, or sourcetype, etc.). In certain cases, thebucket merge policy can indicate that buckets are to be merged based ona schedule (e.g., during non-working hours) or user login (e.g., when aparticular user is not logged in), etc. In certain embodiments, thebucket merge policy can indicate that bucket merges can be adjusteddynamically. For example, based on the rate of incoming data or queries,the bucket merge policy can indicate that buckets are to be merged moreor less frequently, etc. In some cases, the bucket merge policy canindicate that due to increased processing demands by other indexingnodes 404 or other components of an indexing node 404, such asprocessing and storing buckets, that bucket merges are to occur lessfrequently so that the computing resources used to merge buckets can beredirected to other tasks. It will be understood that a variety ofpriorities and policies can be used as part of the bucket merge policy.

At block 1210, the indexing node 404 stores the merged buckets in commonstorage 216. In certain embodiments, the indexing node 404 can store themerged buckets based on the bucket merge policy. For example, based onthe bucket merge policy indicating that merged buckets are to satisfy asize threshold, the indexing node 404 can store a merged bucket once itsatisfies the size threshold. Similarly, the indexing node 404 can storethe merged buckets after a predetermined amount of time or duringnon-working hours, etc., per the bucket merge policy.

In response to the storage of the merged buckets in common storage 216,the indexing node 404 can receive an acknowledgement that the mergedbuckets have been stored. In some cases, the acknowledgement can includeinformation about the merged buckets, including, but not limited to, astorage location in common storage 216, identifier, etc.

At block 1212, the indexing node 404 updates the data store catalog 220.As described herein, the indexing node 404 can store information aboutthe merged buckets in the data store catalog. 220. The information canbe similar to the information stored in the data store catalog 220 forthe pre-merged buckets (buckets used to create the merged buckets). Forexample, in some cases, the indexing node 404 can store any one or anycombination of the following in the data store catalog: the tenant orpartition associated with the merged buckets, a time range of the mergedbucket, the location information of the merged bucket in common storage216, metadata fields associated with the bucket (e.g., host, source,sourcetype), etc. As mentioned, the information about the merged bucketsin the data store catalog 220 can be used by the query system 214 toidentify relevant buckets for a search. Accordingly, in someembodiments, the data store catalog 220 can be used in a similar fashionas an inverted index, and can include similar information (e.g., timeranges, field-value pairs, keyword pairs, location information, etc.).However, instead of providing information about individual events in abucket, the data store catalog 220 can provide information aboutindividual buckets in common storage 216.

In some cases, the indexing node 404 can retrieve information from thedata store catalog 220 about the pre-merged buckets and use thatinformation to generate information about the merged bucket(s) forstorage in the data store catalog 220. For example, the indexing node404 can use the time ranges of the pre-merged buckets to generate amerged time range, identify metadata fields associated with thedifferent events in the pre-merged buckets, etc. In certain embodiments,the indexing node 404 can generate the information about the mergedbuckets for the data store catalog 220 from the merged data itselfwithout retrieving information about the pre-merged buckets from thedata store catalog 220.

In certain embodiments, as part of updating the data store catalog 220with information about the merged buckets, the indexing node 404 candelete the information in the data store catalog 220 about thepre-merged buckets. For example, once the merged bucket is stored incommon storage 216, the merged bucket can be used for queries. As such,the information about the pre-merged buckets can be removed so that thequery system 214 does not use the pre-merged buckets to execute a query.

Fewer, more, or different blocks can be used as part of the routine1200. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the indexing node 404 can delete locally storedbuckets. In some cases, the indexing node 404 deletes any buckets usedto form merged buckets and/or the merged buckets. In this way, theindexing node 404 can reduce the amount of data stored in the indexingnode 404.

In certain embodiments, the indexing node 404 can instruct the commonstorage 216 to delete buckets or delete the buckets in common storageaccording to a bucket management policy. For example, the indexing node404 can instruct the common storage 216 to delete any buckets used togenerate the merged buckets. Based on the bucket management policy, thecommon storage 216 can remove the buckets. As described herein, thebucket management policy can indicate when buckets are to be removedfrom common storage 216. For example, the bucket management policy canindicate that buckets are to be removed from common storage 216 after apredetermined amount of time, once any queries relying on the pre-mergedbuckets are completed, etc.

By removing buckets from common storage 216, the indexing node 404 canreduce the size or amount of data stored in common storage 216 andimprove search times. For example, in some cases, large buckets canincrease search times as there are fewer buckets for the query system214 to search. By another example, merging buckets after indexing allowsoptimal or near-optimal bucket sizes for search (e.g., performed byquery system 214) and index (e.g., performed by indexing system 212) tobe determined independently or near-independently.

Furthermore, it will be understood that the various blocks describedherein with reference to FIG. 12 can be implemented in a variety oforders. In some cases, the indexing node 404 can implement some blocksconcurrently or change the order as desired. For example, the indexingnode 404 can concurrently merge buckets while updating an ingestionbuffer 310 about the data stored in common storage 216 or updating thedata store catalog 220. As another example, the indexing node 404 candelete data about the pre-merged buckets locally and instruct the commonstorage 216 to delete the data about the pre-merged buckets whileconcurrently updating the data store catalog 220 about the mergedbuckets. In some embodiments, the indexing node 404 deletes thepre-merged bucket data entries in the data store catalog 220 prior toinstructing the common storage 216 to delete the buckets. In this way,the data indexing node 404 can reduce the risk that a query relies oninformation in the data store catalog 220 that does not reflect the datastored in the common storage 216.

4.3. Querying

FIG. 13 is a data flow diagram illustrating an embodiment of the dataflow and communications between a variety of the components of the dataintake and query system 108 during execution of a query. Specifically,FIG. 13 is a data flow diagram illustrating an embodiment of the dataflow and communications between the indexing system 212, the data storecatalog 220, a search head 504, a search node monitor 508, search nodecatalog 510, search nodes 506, common storage 216, and the queryacceleration data store 222. However, it will be understood, that insome of embodiments, one or more of the functions described herein withrespect to FIG. 13 can be omitted, performed in a different order and/orperformed by a different component of the data intake and query system108. Accordingly, the illustrated embodiment and description should notbe construed as limiting.

Further, it will be understood that the various functions describedherein with respect to FIG. 13 can be performed by one or more distinctcomponents of the data intake and query system 108. For example, forsimplicity, reference is made to a search head 504 performing one ormore functions. However, it will be understood that these functions canbe performed by one or more components of the search head 504, such as,but not limited to, the search master 512 and/or the search manager 514.Similarly, reference is made to the indexing system 212 performing oneor more functions. However, it will be understood that the functionsidentified as being performed by the indexing system 212 can beperformed by one or more components of the indexing system 212.

At (1) and (2), the indexing system 212 monitors the storage ofprocessed data and updates the data store catalog 220 based on themonitoring. As described herein, one or more components of the indexingsystem 212, such as the partition manager 408 and/or the indexer 410 canmonitor the storage of data or buckets to common storage 216. As thedata is stored in common storage 216, the indexing system 212 can obtaininformation about the data stored in the common storage 216, such as,but not limited to, location information, bucket identifiers, tenantidentifier (e.g., for buckets that are single tenant) etc. The indexingsystem 212 can use the received information about the data stored incommon storage 216 to update the data store catalog 220.

Furthermore, as described herein, in some embodiments, the indexingsystem 212 can merge buckets into one or more merged buckets, store themerged buckets in common storage 216, and update the data store catalogto 220 with the information about the merged buckets stored in commonstorage 216.

At (3) and (4), the search node monitor 508 monitors the search nodes506 and updates the search node catalog 510. As described herein, thesearch node monitor 508 can monitor the availability, responsiveness,and/or utilization rate of the search nodes 506. Based on the status ofthe search nodes 506, the search node monitor 508 can update the searchnode catalog 510. In this way, the search node catalog 510 can retaininformation regarding a current status of each of the search nodes 506in the query system 214.

At (5), the search head 504 receives a query and generates a searchmanager 514. As described herein, in some cases, a search master 512 cangenerate the search manager 514. For example, the search master 512 canspin up or instantiate a new process, container, or virtual machine, orcopy itself to generate the search manager 514, etc. As describedherein, in some embodiments, the search manager 514 can perform one ormore of functions described herein with reference to FIG. 13 as beingperformed by the search head 504 to process and execute the query.

The search head 504 (6A) requests data identifiers from the data storecatalog 220 and (6B) requests an identification of available searchnodes from the search node catalog 510. As described, the data storecatalog 220 can include information regarding the data stored in commonstorage 216 and the search node catalog 510 can include informationregarding the search nodes 506 of the query system 214. Accordingly, thesearch head 504 can query the respective catalogs to identify data orbuckets that include data that satisfies at least a portion of the queryand search nodes available to execute the query. In some cases, theserequests can be done concurrently or in any order.

At (7A), the data store catalog 220 provides the search head 504 with anidentification of data that satisfies at least a portion of the query.As described herein, in response to the request from the search head504, the data store catalog 220 can be used to identify and returnidentifiers of buckets in common storage 216 and/or location informationof data in common storage 216 that satisfy at least a portion of thequery or at least some filter criteria (e.g., buckets associated with anidentified tenant or partition or that satisfy an identified time range,etc.).

In some cases, as the data store catalog 220 can routinely receiveupdates by the indexing system 212, it can implement a read-write lockwhile it is being queried by the search head 504. Furthermore, the datastore catalog 220 can store information regarding which buckets wereidentified for the search. In this way, the data store catalog 220 canbe used by the indexing system 212 to determine which buckets in commonstorage 216 can be removed or deleted as part of a merge operation.

At (7B), the search node catalog 510 provides the search head 504 withan identification of available search nodes 506. As described herein, inresponse to the request from the search head 504, the search nodecatalog 510 can be used to identify and return identifiers for searchnodes 506 that are available to execute the query.

At (8) the search head 504 maps the identified search nodes 506 to thedata according to a search node mapping policy. In some cases, per thesearch node mapping policy, the search head 504 can dynamically mapsearch nodes 506 to the identified data or buckets. As described herein,the search head 504 can map the identified search nodes 506 to theidentified data or buckets at one time or iteratively as the buckets aresearched according to the search node mapping policy. In certainembodiments, per the search node mapping policy, the search head 504 canmap the identified search nodes 506 to the identified data based onprevious assignments, data stored in a local or shared data store of oneor more search heads 506, network architecture of the search nodes 506,a hashing algorithm, etc.

In some cases, as some of the data may reside in a local or shared datastore between the search nodes 506, the search head 504 can attempt tomap that was previously assigned to a search node 506 to the same searchnode 506. In certain embodiments, to map the data to the search nodes506, the search head 504 uses the identifiers, such as bucketidentifiers, received from the data store catalog 220. In someembodiments, the search head 504 performs a hash function to map abucket identifier to a search node 506. In some cases, the search head504 uses a consistent hash algorithm to increase the probability ofmapping a bucket identifier to the same search node 506.

In certain embodiments, the search head 504 or query system 214 canmaintain a table or list of bucket mappings to search nodes 506. In suchembodiments, per the search node mapping policy, the search head 504 canuse the mapping to identify previous assignments between search nodesand buckets. If a particular bucket identifier has not been assigned toa search node 506, the search head 504 can use a hash algorithm toassign it to a search node 506. In certain embodiments, prior to usingthe mapping for a particular bucket, the search head 504 can confirmthat the search node 506 that was previously assigned to the particularbucket is available for the query. In some embodiments, if the searchnode 506 is not available for the query, the search head 504 candetermine whether another search node 506 that shares a data store withthe unavailable search node 506 is available for the query. If thesearch head 504 determines that an available search node 506 shares adata store with the unavailable search node 506, the search head 504 canassign the identified available search node 506 to the bucket identifierthat was previously assigned to the now unavailable search node 506.

At (9), the search head 504 instructs the search nodes 506 to executethe query. As described herein, based on the assignment of buckets tothe search nodes 506, the search head 504 can generate searchinstructions for each of the assigned search nodes 506. Theseinstructions can be in various forms, including, but not limited to,JSON, DAG, etc. In some cases, the search head 504 can generatesub-queries for the search nodes 506. Each sub-query or instructions fora particular search node 506 generated for the search nodes 506 canidentify the buckets that are to be searched, the filter criteria toidentify a subset of the set of data to be processed, and the manner ofprocessing the subset of data. Accordingly, the instructions can providethe search nodes 506 with the relevant information to execute theirparticular portion of the query.

At (10), the search nodes 506 obtain the data to be searched. Asdescribed herein, in some cases the data to be searched can be stored onone or more local or shared data stores of the search nodes 506. Incertain embodiments, the data to be searched is located in the commonstorage 216. In such embodiments, the search nodes 506 or a cachemanager 516 can obtain the data from the common storage 216.

In some cases, the cache manager 516 can identify or obtain the datarequested by the search nodes 506. For example, if the requested data isstored on the local or shared data store of the search nodes 506, thecache manager 516 can identify the location of the data for the searchnodes 506. If the requested data is stored in common storage 216, thecache manager 516 can obtain the data from the common storage 216.

As described herein, in some embodiments, the cache manager 516 canobtain a subset of the files associated with the bucket to be searchedby the search nodes 506. For example, based on the query, the searchnode 506 can determine that a subset of the files of a bucket are to beused to execute the query. Accordingly, the search node 506 can requestthe subset of files, as opposed to all files of the bucket. The cachemanager 516 can download the subset of files from common storage 216 andprovide them to the search node 506 for searching.

In some embodiments, such as when a search node 506 cannot uniquelyidentify the file of a bucket to be searched, the cache manager 516 candownload a bucket summary or manifest that identifies the filesassociated with the bucket. The search node 506 can use the bucketsummary or manifest to uniquely identify the file to be used in thequery. The common storage 216 can then obtain that uniquely identifiedfile from common storage 216.

At (11), the search nodes 506 search and process the data. As describedherein, the sub-queries or instructions received from the search head504 can instruct the search nodes 506 to identify data within one ormore buckets and perform one or more transformations on the data.Accordingly, each search node 506 can identify a subset of the set ofdata to be processed and process the subset of data according to thereceived instructions. This can include searching the contents of one ormore inverted indexes of a bucket or the raw machine data or events of abucket, etc. In some embodiments, based on the query or sub-query, asearch node 506 can perform one or more transformations on the datareceived from each bucket or on aggregate data from the differentbuckets that are searched by the search node 506.

At (12), the search head 504 monitors the status of the query of thesearch nodes 506. As described herein, the search nodes 506 can becomeunresponsive or fail for a variety of reasons (e.g., network failure,error, high utilization rate, etc.). Accordingly, during execution ofthe query, the search head 504 can monitor the responsiveness andavailability of the search nodes 506. In some cases, this can be done bypinging or querying the search nodes 506, establishing a persistentcommunication link with the search nodes 506, or receiving statusupdates from the search nodes 506. In some cases, the status canindicate the buckets that have been searched by the search nodes 506,the number or percentage of remaining buckets to be searched, thepercentage of the query that has been executed by the search node 506,etc. In some cases, based on a determination that a search node 506 hasbecome unresponsive, the search head 504 can assign a different searchnode 506 to complete the portion of the query assigned to theunresponsive search node 506.

In certain embodiments, depending on the status of the search nodes 506,the search manager 514 can dynamically assign or re-assign buckets tosearch nodes 506. For example, as search nodes 506 complete their searchof buckets assigned to them, the search manager 514 can assignadditional buckets for search. As yet another example, if one searchnode 506 is 95% complete with its search while another search node 506is less than 50% complete, the query manager can dynamically assignadditional buckets to the search node 506 that is 95% complete orre-assign buckets from the search node 506 that is less than 50%complete to the search node that is 95% complete. In this way, thesearch manager 514 can improve the efficiency of how a computing systemperforms searches through the search manager 514 increasingparallelization of searching and decreasing the search time.

At (13), the search nodes 506 send individual query results to thesearch head 504. As described herein, the search nodes 506 can send thequery results as they are obtained from the buckets and/or send theresults once they are completed by a search node 506. In someembodiments, as the search head 504 receives results from individualsearch nodes 506, it can track the progress of the query. For example,the search head 504 can track which buckets have been searched by thesearch nodes 506. Accordingly, in the event a search node 506 becomesunresponsive or fails, the search head 504 can assign a different searchnode 506 to complete the portion of the query assigned to theunresponsive search node 506. By tracking the buckets that have beensearched by the search nodes and instructing different search node 506to continue searching where the unresponsive search node 506 left off,the search head 504 can reduce the delay caused by a search node 506becoming unresponsive, and can aid in providing a stateless searchingservice.

At (14), the search head 504 processes the results from the search nodes506. As described herein, the search head 504 can perform one or moretransformations on the data received from the search nodes 506. Forexample, some queries can include transformations that cannot becompleted until the data is aggregated from the different search nodes506. In some embodiments, the search head 504 can perform thesetransformations.

At (15), the search head 504 stores results in the query accelerationdata store 222. As described herein, in some cases some, all, or a copyof the results of the query can be stored in the query acceleration datastore 222. The results stored in the query acceleration data store 222can be combined with other results already stored in the queryacceleration data store 222 and/or be combined with subsequent results.For example, in some cases, the query system 214 can receive ongoingqueries, or queries that do not have a predetermined end time. In suchcases, as the search head 504 receives a first set of results, it canstore the first set of results in the query acceleration data store 222.As subsequent results are received, the search head 504 can add them tothe first set of results, and so forth. In this way, rather thanexecuting the same or similar query data across increasingly larger timeranges, the query system 214 can execute the query across a first timerange and then aggregate the results of the query with the results ofthe query across the second time range. In this way, the query systemcan reduce the amount of queries and the size of queries being executedand can provide query results in a more time efficient manner.

At (16), the search head 504 terminates the search manager 514. Asdescribed herein, in some embodiments a search head 504 or a searchmaster 512 can generate a search manager 514 for each query assigned tothe search head 504. Accordingly, in some embodiments, upon completionof a search, the search head 504 or search master 512 can terminate thesearch manager 514. In certain embodiments, rather than terminating thesearch manager 514 upon completion of a query, the search head 504 canassign the search manager 514 to a new query.

As mentioned previously, in some of embodiments, one or more of thefunctions described herein with respect to FIG. 13 can be omitted,performed in a variety of orders and/or performed by a differentcomponent of the data intake and query system 108. For example, thesearch head 504 can monitor the status of the query throughout itsexecution by the search nodes 506 (e.g., during (10), (11), and (13)).Similarly, (1) and (2) can be performed concurrently, (3) and (4) can beperformed concurrently, and all can be performed before, after, orconcurrently with (5). Similarly, steps (6A) and (6B) and steps (7A) and(7B) can be performed before, after, or concurrently with each other.Further, (6A) and (7A) can be performed before, after, or concurrentlywith (7A) and (7B). As yet another example, (10), (11), and (13) can beperformed concurrently. For example, a search node 506 can concurrentlyreceive one or more files for one bucket, while searching the content ofone or more files of a second bucket and sending query results for athird bucket to the search head 504. Similarly, the search head 504 can(8) map search nodes 506 to buckets while concurrently (9) generatinginstructions for and instructing other search nodes 506 to beginexecution of the query.

4.3.1. Containerized Search Nodes

FIG. 14 is a flow diagram illustrative of an embodiment of a routine1400 implemented by the query system 214 to execute a query. Althoughdescribed as being implemented by the search head 504, it will beunderstood that the elements outlined for routine 1400 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the query system manager 502, the search head 504, thesearch master 512, the search manager 514, the search nodes 506, etc.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 1402, the search manager 514 receives a query. As described ingreater detail above, the search manager 514 can receive the query fromthe search head 504, search master 512, etc. In some cases, the searchmanager 514 can receive the query from a client device 204. The querycan be in a query language as described in greater detail above. In somecases, the query received by the search manager 514 can correspond to aquery received and reviewed by the search head 504. For example, thesearch head 504 can determine whether the query was submitted by anauthenticated user and/or review the query to determine that it is in aproper format for the data intake and query system 108, has correctsemantics and syntax, etc. In some cases, the search head 504 can use asearch master 512 to receive search queries, and in some cases, spawnthe search manager 514 to process and execute the query.

At block 1404, the search manager 514 identifies one or morecontainerized search nodes, e.g., search nodes 506, to execute thequery. As described herein, the query system 214 can include multiplecontainerized search nodes 506 to execute queries. One or more of thecontainerized search nodes 506 can be instantiated on the same computingdevice, and share the resources of the computing device. In addition,the containerized search nodes 506 can enable the query system 214 toprovide a highly extensible and dynamic searching service. For example,based on resource availability and/or workload, the query system 214 caninstantiate additional containerized search nodes 506 or terminatecontainerized search nodes 506. Furthermore, the query system 214 candynamically assign containerized search nodes 506 to execute queries ondata in common storage 216 based on a search node mapping policy.

As described herein, each search node 506 can be implemented usingcontainerization or operating-system-level virtualization, or othervirtualization technique. For example, the containerized search node506, or one or more components of the search node 506 can be implementedas separate containers or container instances. Each container instancecan have certain resources (e.g., memory, processor, etc.) of theunderlying computing system assigned to it, but may share the sameoperating system and may use the operating system's system callinterface. Further, each container may run the same or differentcomputer applications concurrently or separately, and may interact witheach other. It will be understood that other virtualization techniquescan be used. For example, the containerized search nodes 506 can beimplemented using virtual machines using full virtualization orparavirtualization, etc.

In some embodiments, the search node 506 can be implemented as a groupof related containers or a pod, and the various components of the searchnode 506 can be implemented as related containers of a pod. Further, thesearch node 506 can assign different containers to execute differenttasks. For example one container of a containerized search node 506 canreceive and query instructions, a second container can obtain the dataor buckets to be searched, and a third container of the containerizedsearch node 506 can search the buckets and/or perform one or moretransformations on the data. However, it will be understood that thecontainerized search node 506 can be implemented in a variety ofconfigurations. For example, in some cases, the containerized searchnode 506 can be implemented as a single container and can includemultiple processes to implement the tasks described above by the threecontainers. Any combination of containerization and processed can beused to implement the containerized search node 506 as desired.

In some cases, the search manager 514 can identify the search nodes 506using the search node catalog 510. For example, as described herein asearch node monitor 508 can monitor the status of the search nodes 506instantiated in the query system 514 and monitor their status. Thesearch node monitor can store the status of the search nodes 506 in thesearch node catalog 510.

In certain embodiments, the search manager 514 can identify search nodes506 using a search node mapping policy, previous mappings, previoussearches, or the contents of a data store associated with the searchnodes 506. For example, based on the previous assignment of a searchnode 506 to search data as part of a query, the search manager 514 canassign the search node 506 to search the same data for a differentquery. As another example, as search nodes 506 search data, it can cachethe data in a local or shared data store. Based on the data in thecache, the search manager 514 can assign the search node 506 to searchthe again as part of a different query.

In certain embodiments, the search manager 514 can identify search nodes506 based on shared resources. For example, if the search manager 514determines that a search node 506 shares a data store with a search node506 that previously performed a search on data and cached the data inthe shared data store, the search manager 514 can assign the search node506 that share the data store to search the data stored therein as partof a different query.

In some embodiments, the search manager 514 can identify search nodes506 using a hashing algorithm. For example, as described herein, thesearch manager 514 based can perform a hash on a bucket identifier of abucket that is to be searched to identify a search node to search thebucket. In some implementations, that hash may be a consistent hash, toincrease the chance that the same search node will be selected to searchthat bucket as was previously used, thereby reducing the chance that thebucket must be retrieved from common storage 216.

It will be understood that the search manger 514 can identify searchnodes 506 based on any one or any combination of the aforementionedmethods. Furthermore, it will be understood that the search manager 514can identify search nodes 506 in a variety of ways.

At 1406, the search manager 514 instructs the search nodes 506 toexecute the query. As described herein, the search manager 514 canprocess the query to determine portions of the query that it willexecute and portions of the query to be executed by the search nodes506. Furthermore, the search manager 514 can generate instructions orsub-queries for each search node 506 that is to execute a portion of thequery. In some cases, the search manager 514 generates a DAG forexecution by the search nodes 506. The instructions or sub-queries canidentify the data or buckets to be searched by the search nodes 506. Inaddition, the instructions or sub-queries may identify one or moretransformations that the search nodes 506 are to perform on the data.

Fewer, more, or different blocks can be used as part of the routine1400. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the search manager 514 can receive partial resultsfrom the search nodes 506, process the partial results, perform one ormore transformation on the partial results or aggregated results, etc.Further, in some embodiments, the search manager 514 provide the resultsto a client device 204. In some embodiments, the search manager 514 cancombine the results with results stored in the accelerated data store222 or store the results in the accelerated data store 222 forcombination with additional search results.

In some cases, the search manager 514 can identify the data or bucketsto be searched by, for example, using the data store catalog 220, andmap the buckets to the search nodes 506 according to a search nodemapping policy. As described herein, the data store catalog 220 canreceive updates from the indexing system 212 about the data that isstored in common storage 216. The information in the data store catalog220 can include, but is not limited to, information about the locationof the buckets in common storage 216, and other information that can beused by the search manager 514 to identify buckets that include datathat satisfies at least a portion of the query.

In certain cases, as part of executing the query, the search nodes 506can obtain the data to be searched from common storage 216 using thecache manager 516. The obtained data can be stored on a local or shareddata store and searched as part of the query. In addition, the data canbe retained on the local or shared data store based on a bucket cachingpolicy as described herein.

Furthermore, it will be understood that the various blocks describedherein with reference to FIG. 14 can be implemented in a variety oforders. In some cases, the search manager 514 can implement some blocksconcurrently or change the order as desired. For example, the searchmanager 514 an concurrently identify search nodes 506 to execute thequery and instruct the search nodes 506 to execute the query. Asdescribed herein, in some embodiments, the search manager 514 caninstruct the search nodes 506 to execute the query at once. In certainembodiments, the search manager 514 can assign a first group of bucketsfor searching, and dynamically assign additional groups of buckets tosearch nodes 506 depending on which search nodes 506 complete theirsearching first or based on an updated status of the search nodes 506,etc.

4.3.2. Identifying Buckets and Search Nodes for Query

FIG. 15 is a flow diagram illustrative of an embodiment of a routine1500 implemented by the query system 214 to execute a query. Althoughdescribed as being implemented by the search manager 514, it will beunderstood that the elements outlined for routine 1500 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the query system manager 502, the search head 504, thesearch master 512, the search manager 514, the search nodes 506, etc.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 1502, the search manager 514 receives a query, as described ingreater detail herein at least with reference to block 1402 of FIG. 14 .

At block 1504, the search manager 514 identifies search nodes to executethe query, as described in greater detail herein at least with referenceto block 1404 of FIG. 14 . However, it will be noted, that in certainembodiments, the search nodes 506 may not be containerized.

At block 1506, the search manager 514 identifies buckets to query. Asdescribed herein, in some cases, the search manager 514 can consult thedata store catalog 220 to identify buckets to be searched. In certainembodiments, the search manager 514 can use metadata of the bucketsstored in common storage 216 to identify the buckets for the query. Forexample, the search manager 514 can compare a tenant identifier and/orpartition identifier associated with the query with the tenantidentifier and/or partition identifier of the buckets. The searchmanager 514 can exclude buckets that have a tenant identifier and/orpartition identifier that does not match the tenant identifier and/orpartition identifier associated with the query. Similarly, the searchmanager can compare a time range associate with the query with the timerange associated with the buckets in common storage 216. Based on thecomparison, the search manager 514 can identify buckets that satisfy thetime range associated with the query (e.g., at least partly overlap withthe time range from the query).

At 1508, the search manager 514 executes the query. As described herein,at least with reference to 1406 of FIG. 14 , in some embodiments, aspart of executing the query, the search manager 514 can process thesearch query, identify tasks for it to complete and tasks for the searchnodes 506, generate instructions or sub-queries for the search nodes 506and instruct the search nodes 506 to execute the query. Further, thesearch manager 514 can aggregate the results from the search nodes 506and perform one or more transformations on the data.

Fewer, more, or different blocks can be used as part of the routine1500. In some cases, one or more blocks can be omitted. For example, asdescribed herein, the search manager 514 can map the search nodes 506 tocertain data or buckets for the search according to a search nodemapping policy. Based on the search node mapping policy, search manager514 can instruct the search nodes to search the buckets to which theyare mapped. Further, as described herein, in some cases, the search nodemapping policy can indicate that the search manager 514 is to use ahashing algorithm, previous assignment, network architecture, cacheinformation, etc., to map the search nodes 506 to the buckets.

As another example, the routine 1500 can include storing the searchresults in the accelerated data store 222. Furthermore, as describedherein, the search nodes 506 can store buckets from common storage 216to a local or shared data store for searching, etc.

In addition, it will be understood that the various blocks describedherein with reference to FIG. 15 can be implemented in a variety oforders, or implemented concurrently. For example, the search manager 514can identify search nodes to execute the query and identify bucket forthe query concurrently or in any order.

4.3.3. Identifying Buckets for Query Execution

FIG. 16 is a flow diagram illustrative of an embodiment of a routine1600 implemented by the query system 214 to identify buckets for queryexecution. Although described as being implemented by the search manager514, it will be understood that the elements outlined for routine 1600can be implemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the query system manager 502, the search head 504, thesearch master 512, the search manager 514, the search nodes 506, etc.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 1602, the data intake and query system 108 maintains a catalogof bucket in common storage 216. As described herein, the catalog canalso be referred to as the data store catalog 220, and can includeinformation about the buckets in common storage 216, such as, but notlimited to, location information, metadata fields, tenant and partitioninformation, time range information, etc. Further, the data storecatalog 220 can be kept up-to-date based on information received fromthe indexing system 212 as the indexing system 212 processes and storesdata in the common storage 216.

At block 1604, the search manager 514 receives a query, as described ingreater detail herein at least with reference to block 1402 of FIG. 14 .

At block 1606, the search manager 514 identifies buckets to be searchedas part of the query using the data store catalog 220. As describedherein, the search manager 514 can use the data store catalog 220 tofilter the universe of buckets in the common storage 216 to buckets thatinclude data that satisfies at least a portion of the query. Forexample, if a query includes a time range of 4/23/18 from 03:30:50 to04:53:32, the search manager 514 can use the time range information inthe data store catalog to identify buckets with a time range thatoverlaps with the time range provided in the query. In addition, if thequery indicates that only a _main partition is to be searched, thesearch manager 514 can use the information in the data store catalog toidentify buckets that satisfy the time range and are associated with the_main partition. Accordingly, depending on the information in the queryand the information stored in the data store catalog 220 about thebuckets, the search manager 514 can reduce the number of buckets to besearched. In this way, the data store catalog 220 can reduce search timeand the processing resources used to execute a query.

At block 1608, the search manager 514 executes the query, as describedin greater detail herein at least with reference to block 1508 of FIG.15 .

Fewer, more, or different blocks can be used as part of the routine1600. In some cases, one or more blocks can be omitted. For example, asdescribed herein, the search manager 514 can identify and map searchnodes 306 to the buckets for searching or store the search results inthe accelerated data store 222. Furthermore, as described herein, thesearch nodes 506 can store buckets from common storage 216 to a local orshared data store for searching, etc. In addition, it will be understoodthat the various blocks described herein with reference to FIG. 15 canbe implemented in a variety of orders, or implemented concurrently.

4.3.4. Identifying Search Nodes for Query Execution

FIG. 17 is a flow diagram illustrative of an embodiment of a routine1700 implemented by the query system 214 to identify search nodes forquery execution. Although described as being implemented by the searchmanager 514, it will be understood that the elements outlined forroutine 1700 can be implemented by one or more computingdevices/components that are associated with the data intake and querysystem 108, such as, but not limited to, the query system manager 502,the search head 504, the search master 512, the search manager 514, thesearch nodes 506, etc. Thus, the following illustrative embodimentshould not be construed as limiting.

At block 1702, the query system 214 maintains a catalog of instantiatedsearch nodes 506. As described herein, the catalog can also be referredto as the search node catalog 510, and can include information about thesearch nodes 506, such as, but not limited to, availability,utilization, responsiveness, network architecture, etc. Further, thesearch node catalog 510 can be kept up-to-date based on informationreceived by the search node monitor 508 from the search nodes 506.

At block 1704, the search manager 514 receives a query, as described ingreater detail herein at least with reference to block 1402 of FIG. 14 .At block 1706, the search manager 514 identifies available search nodesusing the search node catalog 220.

At block 1708, the search manager 514 instructs the search nodes 506 toexecute the query, as described in greater detail herein at least withreference to block 1406 of FIG. 14 and block 1508 of FIG. 15 .

Fewer, more, or different blocks can be used as part of the routine1700. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the search manager can identify buckets in commonstorage 216 for searching. In addition, it will be understood that thevarious blocks described herein with reference to FIG. 17 can beimplemented in a variety of orders, or implemented concurrently.

4.3.5. Hashing Bucket Identifiers for Query Execution

FIG. 18 is a flow diagram illustrative of an embodiment of a routine1800 implemented by the query system 214 to hash bucket identifiers forquery execution. Although described as being implemented by the searchmanager 514, it will be understood that the elements outlined forroutine 1800 can be implemented by one or more computingdevices/components that are associated with the data intake and querysystem 108, such as, but not limited to, the query system manager 502,the search head 504, the search master 512, the search manager 514, thesearch nodes 506, etc. Thus, the following illustrative embodimentshould not be construed as limiting.

At block 1802, the search manager 514 receives a query, as described ingreater detail herein at least with reference to block 1402 of FIG. 14 .

At block 1804, the search manager 514 identifies bucket identifiersassociated with buckets to be searched as part of the query. The bucketidentifiers can correspond to an alphanumeric identifier or otheridentifier that can be used to uniquely identify the bucket from otherbuckets stored in common storage 216. In some embodiments, the uniqueidentifier may incorporate one or more portions of a tenant identifier,partition identifier, or time range of the bucket or a random orsequential (e.g., based on time of storage, creation, etc.) alphanumericstring, etc. As described herein, the search manager 514 can parse thequery to identify buckets to be searched. In some cases, the searchmanager 514 can identify buckets to be searched and an associated bucketidentifier based on metadata of the buckets and/or using a data storecatalog 220. However, it will be understood that the search manager 514can use a variety of techniques to identify buckets to be searched.

At block 1806, the search manager 514 performs a hash function on thebucket identifiers. The search manager can, in some embodiments, use theoutput of the hash function to identify a search node 506 to search thebucket. For example, as a non-limiting example, consider a scenario inwhich a bucket identifier is 4149 and the search manager 514 identifiedten search nodes to process the query. The search manager 514 couldperform a modulo ten operation on the bucket identifier to determinewhich search node 506 is to search the bucket. Based on this example,the search manager 514 would assign the ninth search node 506 to searchthe bucket, e.g., because the value 4149 modulo ten is 9, so the buckethaving the identifier 4149 is assigned to the ninth search node. In somecases, the search manager can use a consistent hash to increase thelikelihood that the same search node 506 is repeatedly assigned to thesame bucket for searching. In this way, the search manager 514 canincrease the likelihood that the bucket to be searched is alreadylocated in a local or shared data store of the search node 506, andreduce the likelihood that the bucket will be downloaded from commonstorage 216. It will be understood that the search manager can use avariety of techniques to map the bucket to a search node 506 accordingto a search node mapping policy. For example, the search manager 514 canuse previous assignments, network architecture, etc., to assign bucketsto search nodes 506 according to the search node mapping policy.

At block 1808, the search manager 514 instructs the search nodes 506 toexecute the query, as described in greater detail herein at least withreference to block 4906 of FIG. 49 and block 1508 of FIG. 15 .

Fewer, more, or different blocks can be used as part of the routine1800. In some cases, one or more blocks can be omitted. In addition, itwill be understood that the various blocks described herein withreference to FIG. 18 can be implemented in a variety of orders, orimplemented concurrently.

4.3.6. Obtaining Data for Query Execution

FIG. 19 is a flow diagram illustrative of an embodiment of a routine1900 implemented by a search node 506 to execute a search on a bucket.Although reference is made to downloading and searching a bucket, itwill be understood that this can refer to downloading and searching oneor more files associated within a bucket and does not necessarily referto downloading all files associated with the bucket.

Further, although described as being implemented by the search node 506,it will be understood that the elements outlined for routine 1900 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the query system manager 502, the search head 504, thesearch master 512, search manager 514, cache manager 516, etc. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 1902, the search node 506 receives instructions for a query orsub-query. As described herein, a search manager 514 can receive andparse a query to determine the tasks to be assigned to the search nodes506, such as, but not limited to, the searching of one or more bucketsin common storage 216, etc. The search node 506 can parse theinstructions and identify the buckets that are to be searched. In somecases, the search node 506 can determine that a bucket that is to besearched is not located in the search nodes local or shared data store.

At block 1904, the search node 506 obtains the bucket from commonstorage 216. As described herein, in some embodiments, the search node506 obtains the bucket from common storage 216 in conjunction with acache manager 516. For example, the search node 506 can request thecache manager 516 to identify the location of the bucket. The cachemanager 516 can review the data stored in the local or shared data storefor the bucket. If the cache manager 516 cannot locate the bucket in thelocal or shared data store, it can inform the search node 506 that thebucket is not stored locally and that it will be retrieved from commonstorage 216. As described herein, in some cases, the cache manager 516can download a portion of the bucket (e.g., one or more files) andprovide the portion of the bucket to the search node 506 as part ofinforming the search node 506 that the bucket is not found locally. Thesearch node 506 can use the downloaded portion of the bucket to identifyany other portions of the bucket that are to be retrieved from commonstorage 216.

Accordingly, as described herein, the search node 506 can retrieve allor portions of the bucket from common storage 216 and store theretrieved portions to a local or shared data store.

At block 1906, the search node 506 executes the search on the portionsof the bucket stored in the local data store. As described herein, thesearch node 506 can review one or more files of the bucket to identifydata that satisfies the query. In some cases, the search nodes 506searches an inverted index to identify the data. In certain embodiments,the search node 506 searches the raw machine data, uses one or moreconfiguration files, regex rules, and/or late binding schema to identifydata in the bucket that satisfies the query.

Fewer, more, or different blocks can be used as part of the routine1900. For example, in certain embodiments, the routine 1900 includesblocks for requesting a cache manager 516 to search for the bucket inthe local or shared storage, and a block for informing the search node506 that the requested bucket is not available in the local or shareddata store. As another example, the routine 1900 can include performingone or more transformations on the data, and providing partial searchresults to a search manager 514, etc. In addition, it will be understoodthat the various blocks described herein with reference to FIG. 19 canbe implemented in a variety of orders, or implemented concurrently.

4.3.7. Caching Search Results

FIG. 20 is a flow diagram illustrative of an embodiment of a routine2000 implemented by the query system 212 to store search results.Although described as being implemented by the search manager 514, itwill be understood that the elements outlined for routine 2000 can beimplemented by one or more computing devices/components that areassociated with the data intake and query system 108, such as, but notlimited to, the query system manager 502, the search head 504, thesearch master 512, the search nodes 506, etc. Thus, the followingillustrative embodiment should not be construed as limiting.

At block 2002, the search manager 514 receives a query, as described ingreater detail herein at least with reference to block 4902 of FIG. 49 ,and at block 2004, the search manager 514 executes the query, asdescribed in greater detail herein at least with reference to block 1508of FIG. 15 . For example, as described herein, the search manager 514can identify buckets for searching assign the buckets to search nodes506, and instruct the search nodes 506 to search the buckets.Furthermore, the search manager can receive partial results from each ofthe buckets, and perform one or more transformations on the receiveddata.

At block 2006, the search manager 514 stores the results in theaccelerated data store 222. As described herein, the results can becombined with results previously stored in the accelerated data store222 and/or can be stored for combination with results to be obtainedlater in time. In some cases, the search manager 514 can receive queriesand determine that at least a portion of the results are stored in theaccelerated data store 222. Based on the identification, the searchmanager 514 can generate instructions for the search nodes 506 to obtainresults to the query that are not stored in the accelerated data store222, combine the results in the accelerated data store 222 with resultsobtained by the search nodes 506, and provide the aggregated searchresults to the client device 204, or store the aggregated search resultsin the accelerated data store 222 for further aggregation. By storingresults in the accelerated data store 222, the search manager 514 canreduce the search time and computing resources used for future searchesthat rely on the query results.

Fewer, more, or different blocks can be used as part of the routine2000. In some cases, one or more blocks can be omitted. For example, incertain embodiments, the search manager 514 can consult a data storecatalog 220 to identify buckets, consult a search node catalog 510 toidentify available search nodes, map buckets to search nodes 506, etc.Further, in some cases, the search nodes 506 can retrieve buckets fromcommon storage 216. In addition, it will be understood that the variousblocks described herein with reference to FIG. 20 can be implemented ina variety of orders, or implemented concurrently.

4.4. Data Ingestion, Indexing, and Storage Flow

FIG. 21A is a flow diagram of an example method that illustrates how adata intake and query system 108 processes, indexes, and stores datareceived from data sources 202, in accordance with example embodiments.The data flow illustrated in FIG. 21A is provided for illustrativepurposes only; it will be understood that one or more of the steps ofthe processes illustrated in FIG. 21A may be removed or that theordering of the steps may be changed. Furthermore, for the purposes ofillustrating a clear example, one or more particular system componentsare described in the context of performing various operations duringeach of the data flow stages. For example, the intake system 210 isdescribed as receiving and processing machine data during an inputphase; the indexing system 212 is described as parsing and indexingmachine data during parsing and indexing phases; and a query system 214is described as performing a search query during a search phase.However, other system arrangements and distributions of the processingsteps across system components may be used.

4.4.1. Input

At block 2102, the intake system 210 receives data from an input source,such as a data source 202 shown in FIG. 2 . The intake system 210initially may receive the data as a raw data stream generated by theinput source. For example, the intake system 210 may receive a datastream from a log file generated by an application server, from a streamof network data from a network device, or from any other source of data.In some embodiments, the intake system 210 receives the raw data and maysegment the data stream into messages, possibly of a uniform data size,to facilitate subsequent processing steps. The intake system 210 maythereafter process the messages in accordance with one or more rules, asdiscussed above for example with reference to FIGS. 6 and 7 , to conductpreliminary processing of the data. In one embodiment, the processingconducted by the intake system 210 may be used to indicate one or moremetadata fields applicable to each message. For example, the intakesystem 210 may include metadata fields within the messages, or publishthe messages to topics indicative of a metadata field. These metadatafields may, for example, provide information related to a message as awhole and may apply to each event that is subsequently derived from thedata in the message. For example, the metadata fields may includeseparate fields specifying each of a host, a source, and a source typerelated to the message. A host field may contain a value identifying ahost name or IP address of a device that generated the data. A sourcefield may contain a value identifying a source of the data, such as apathname of a file or a protocol and port related to received networkdata. A source type field may contain a value specifying a particularsource type label for the data. Additional metadata fields may also beincluded during the input phase, such as a character encoding of thedata, if known, and possibly other values that provide informationrelevant to later processing steps.

At block 504, the intake system 210 publishes the data as messages on anoutput ingestion buffer 310. Illustratively, other components of thedata intake and query system 108 may be configured to subscribe tovarious topics on the output ingestion buffer 310, thus receiving thedata of the messages when published to the buffer 310.

4.4.2. Parsing

At block 2106, the indexing system 212 receives messages from the intakesystem 210 (e.g., by obtaining the messages from the output ingestionbuffer 310) and parses the data of the message to organize the data intoevents. In some embodiments, to organize the data into events, theindexing system 212 may determine a source type associated with eachmessage (e.g., by extracting a source type label from the metadatafields associated with the message, etc.) and refer to a source typeconfiguration corresponding to the identified source type. The sourcetype definition may include one or more properties that indicate to theindexing system 212 to automatically determine the boundaries within thereceived data that indicate the portions of machine data for events. Ingeneral, these properties may include regular expression-based rules ordelimiter rules where, for example, event boundaries may be indicated bypredefined characters or character strings. These predefined charactersmay include punctuation marks or other special characters including, forexample, carriage returns, tabs, spaces, line breaks, etc. If a sourcetype for the data is unknown to the indexing system 212, the indexingsystem 212 may infer a source type for the data by examining thestructure of the data. Then, the indexing system 212 can apply aninferred source type definition to the data to create the events.

At block 2108, the indexing system 212 determines a timestamp for eachevent. Similar to the process for parsing machine data, an indexingsystem 212 may again refer to a source type definition associated withthe data to locate one or more properties that indicate instructions fordetermining a timestamp for each event. The properties may, for example,instruct the indexing system 212 to extract a time value from a portionof data for the event, to interpolate time values based on timestampsassociated with temporally proximate events, to create a timestamp basedon a time the portion of machine data was received or generated, to usethe timestamp of a previous event, or use any other rules fordetermining timestamps.

At block 2110, the indexing system 212 associates with each event one ormore metadata fields including a field containing the timestampdetermined for the event. In some embodiments, a timestamp may beincluded in the metadata fields. These metadata fields may include anynumber of “default fields” that are associated with all events, and mayalso include one more custom fields as defined by a user. Similar to themetadata fields associated with the data blocks at block 2104, thedefault metadata fields associated with each event may include a host,source, and source type field including or in addition to a fieldstoring the timestamp.

At block 2112, the indexing system 212 may optionally apply one or moretransformations to data included in the events created at block 2106.For example, such transformations can include removing a portion of anevent (e.g., a portion used to define event boundaries, extraneouscharacters from the event, other extraneous text, etc.), masking aportion of an event (e.g., masking a credit card number), removingredundant portions of an event, etc. The transformations applied toevents may, for example, be specified in one or more configuration filesand referenced by one or more source type definitions.

FIG. 21C illustrates an illustrative example of how machine data can bestored in a data store in accordance with various disclosed embodiments.In other embodiments, machine data can be stored in a flat file in acorresponding bucket with an associated index file, such as a timeseries index or “TSIDX.” As such, the depiction of machine data andassociated metadata as rows and columns in the table of FIG. 21C ismerely illustrative and is not intended to limit the data format inwhich the machine data and metadata is stored in various embodimentsdescribed herein. In one particular embodiment, machine data can bestored in a compressed or encrypted formatted. In such embodiments, themachine data can be stored with or be associated with data thatdescribes the compression or encryption scheme with which the machinedata is stored. The information about the compression or encryptionscheme can be used to decompress or decrypt the machine data, and anymetadata with which it is stored, at search time.

As mentioned above, certain metadata, e.g., host 2136, source 2137,source type 2138 and timestamps 2135 can be generated for each event,and associated with a corresponding portion of machine data 2139 whenstoring the event data in a data store, e.g., data store 212. Any of themetadata can be extracted from the corresponding machine data, orsupplied or defined by an entity, such as a user or computer system. Themetadata fields can become part of or stored with the event. Note thatwhile the time-stamp metadata field can be extracted from the raw dataof each event, the values for the other metadata fields may bedetermined by the indexing system 212 or indexing node 404 based oninformation it receives pertaining to the source of the data separatefrom the machine data.

While certain default or user-defined metadata fields can be extractedfrom the machine data for indexing purposes, all the machine data withinan event can be maintained in its original condition. As such, inembodiments in which the portion of machine data included in an event isunprocessed or otherwise unaltered, it is referred to herein as aportion of raw machine data. In other embodiments, the port of machinedata in an event can be processed or otherwise altered. As such, unlesscertain information needs to be removed for some reasons (e.g.extraneous information, confidential information), all the raw machinedata contained in an event can be preserved and saved in its originalform. Accordingly, the data store in which the event records are storedis sometimes referred to as a “raw record data store.” The raw recorddata store contains a record of the raw event data tagged with thevarious default fields.

In FIG. 21C, the first three rows of the table represent events 2131,2132, and 2133 and are related to a server access log that recordsrequests from multiple clients processed by a server, as indicated byentry of “access.log” in the source column 2136.

In the example shown in FIG. 21C, each of the events 2131-2133 isassociated with a discrete request made from a client device. The rawmachine data generated by the server and extracted from a server accesslog can include the IP address of the client 2140, the user id of theperson requesting the document 2141, the time the server finishedprocessing the request 2142, the request line from the client 2143, thestatus code returned by the server to the client 2145, the size of theobject returned to the client (in this case, the gif file requested bythe client) 2146 and the time spent to serve the request in microseconds2144. As seen in FIG. 21C, all the raw machine data retrieved from theserver access log is retained and stored as part of the correspondingevents, 2131-2133 in the data store.

Event 2134 is associated with an entry in a server error log, asindicated by “error.log” in the source column 2137 that records errorsthat the server encountered when processing a client request. Similar tothe events related to the server access log, all the raw machine data inthe error log file pertaining to event 2134 can be preserved and storedas part of the event 2134.

Saving minimally processed or unprocessed machine data in a data storeassociated with metadata fields in the manner similar to that shown inFIG. 21C is advantageous because it allows search of all the machinedata at search time instead of searching only previously specified andidentified fields or field-value pairs. As mentioned above, because datastructures used by various embodiments of the present disclosuremaintain the underlying raw machine data and use a late-binding schemafor searching the raw machines data, it enables a user to continueinvestigating and learn valuable insights about the raw data. In otherwords, the user is not compelled to know about all the fields ofinformation that will be needed at data ingestion time. As a user learnsmore about the data in the events, the user can continue to refine thelate-binding schema by defining new extraction rules, or modifying ordeleting existing extraction rules used by the system.

4.4.3. Indexing

At blocks 2114 and 2116, the indexing system 212 can optionally generatea keyword index to facilitate fast keyword searching for events. Tobuild a keyword index, at block 2114, the indexing system 212 identifiesa set of keywords in each event. At block 2116, the indexing system 212includes the identified keywords in an index, which associates eachstored keyword with reference pointers to events containing that keyword(or to locations within events where that keyword is located, otherlocation identifiers, etc.). When the data intake and query system 108subsequently receives a keyword-based query, the query system 214 canaccess the keyword index to quickly identify events containing thekeyword.

In some embodiments, the keyword index may include entries for fieldname-value pairs found in events, where a field name-value pair caninclude a pair of keywords connected by a symbol, such as an equals signor colon. This way, events containing these field name-value pairs canbe quickly located. In some embodiments, fields can automatically begenerated for some or all of the field names of the field name-valuepairs at the time of indexing. For example, if the string“dest=10.0.1.2” is found in an event, a field named “dest” may becreated for the event, and assigned a value of “10.0.1.2”.

At block 2118, the indexing system 212 stores the events with anassociated timestamp in a local data store 212 and/or common storage216. Timestamps enable a user to search for events based on a timerange. In some embodiments, the stored events are organized into“buckets,” where each bucket stores events associated with a specifictime range based on the timestamps associated with each event. Thisimproves time-based searching, as well as allows for events with recenttimestamps, which may have a higher likelihood of being accessed, to bestored in a faster memory to facilitate faster retrieval. For example,buckets containing the most recent events can be stored in flash memoryrather than on a hard disk. In some embodiments, each bucket may beassociated with an identifier, a time range, and a size constraint.

The indexing system 212 may be responsible for storing the eventscontained in various data stores 218 of common storage 216. Bydistributing events among the data stores in common storage 216, thequery system 214 can analyze events for a query in parallel. Forexample, using map-reduce techniques, each search node 506 can returnpartial responses for a subset of events to a search head that combinesthe results to produce an answer for the query. By storing events inbuckets for specific time ranges, the indexing system 212 may furtheroptimize the data retrieval process by enabling search nodes 506 tosearch buckets corresponding to time ranges that are relevant to aquery.

In some embodiments, each indexing node 404 (e.g., the indexer 410 ordata store 412) of the indexing system 212 has a home directory and acold directory. The home directory stores hot buckets and warm buckets,and the cold directory stores cold buckets. A hot bucket is a bucketthat is capable of receiving and storing events. A warm bucket is abucket that can no longer receive events for storage but has not yetbeen moved to the cold directory. A cold bucket is a bucket that can nolonger receive events and may be a bucket that was previously stored inthe home directory. The home directory may be stored in faster memory,such as flash memory, as events may be actively written to the homedirectory, and the home directory may typically store events that aremore frequently searched and thus are accessed more frequently. The colddirectory may be stored in slower and/or larger memory, such as a harddisk, as events are no longer being written to the cold directory, andthe cold directory may typically store events that are not as frequentlysearched and thus are accessed less frequently. In some embodiments, anindexing node 404 may also have a quarantine bucket that contains eventshaving potentially inaccurate information, such as an incorrect timestamp associated with the event or a time stamp that appears to be anunreasonable time stamp for the corresponding event. The quarantinebucket may have events from any time range; as such, the quarantinebucket may always be searched at search time. Additionally, an indexingnode 404may store old, archived data in a frozen bucket that is notcapable of being searched at search time. In some embodiments, a frozenbucket may be stored in slower and/or larger memory, such as a harddisk, and may be stored in offline and/or remote storage.

In some embodiments, an indexing node 404 may not include a colddirectory and/or cold or frozen buckets. For example, as warm bucketsand/or merged buckets are copied to common storage 216, they can bedeleted from the indexing node 404. In certain embodiments, one or moredata stores 218 of the common storage 216 can include a home directorythat includes warm buckets copied from the indexing nodes 404 and a colddirectory of cold or frozen buckets as described above.

Moreover, events and buckets can also be replicated across differentindexing nodes 404 and data stores 218 of the common storage 216.

FIG. 21B is a block diagram of an example data store 2101 that includesa directory for each index (or partition) that contains a portion ofdata stored in the data store 2101. FIG. 21B further illustrates detailsof an embodiment of an inverted index 2107B and an event reference array2115 associated with inverted index 2107B.

The data store 2101 can correspond to a data store 218 that storesevents in common storage 216, a data store 412 associated with anindexing node 404, or a data store associated with a search peer 506. Inthe illustrated embodiment, the data store 2101 includes a _maindirectory 2103 associated with a _main partition and a _test directory2105 associated with a _test partition. However, the data store 2101 caninclude fewer or more directories. In some embodiments, multiple indexescan share a single directory or all indexes can share a commondirectory. Additionally, although illustrated as a single data store2101, it will be understood that the data store 2101 can be implementedas multiple data stores storing different portions of the informationshown in FIG. 21B. For example, a single index or partition can spanmultiple directories or multiple data stores, and can be indexed orsearched by multiple search nodes 506.

Furthermore, although not illustrated in FIG. 21B, it will be understoodthat, in some embodiments, the data store 2101 can include directoriesfor each tenant and sub-directories for each partition of each tenant,or vice versa. Accordingly, the directories 2101 and 2103 illustrated inFIG. 21B can, in certain embodiments, correspond to sub-directories of atenant or include sub-directories for different tenants.

In the illustrated embodiment of FIG. 21B, the partition-specificdirectories 2103 and 2105 include inverted indexes 2107A, 2107B and2109A, 2109B, respectively. The inverted indexes 2107A . . . 2107B, and2109A . . . 2109B can be keyword indexes or field-value pair indexesdescribed herein and can include less or more information than depictedin FIG. 21B.

In some embodiments, the inverted index 2107A . . . 2107B, and 2109A . .. 2109B can correspond to a distinct time-series bucket stored in commonstorage 216, a search node 506, or an indexing node 404 and thatcontains events corresponding to the relevant partition (e.g., _mainpartition, _test partition). As such, each inverted index can correspondto a particular range of time for an partition. Additional files, suchas high performance indexes for each time-series bucket of an partition,can also be stored in the same directory as the inverted indexes 2107A .. . 2107B, and 2109A . . . 2109B. In some embodiments inverted index2107A . . . 2107B, and 2109A . . . 2109B can correspond to multipletime-series buckets or inverted indexes 2107A . . . 2107B, and 2109A . .. 2109B can correspond to a single time-series bucket.

Each inverted index 2107A . . . 2107B, and 2109A . . . 2109B can includeone or more entries, such as keyword (or token) entries or field-valuepair entries. Furthermore, in certain embodiments, the inverted indexes2107A . . . 2107B, and 2109A . . . 2109B can include additionalinformation, such as a time range 2123 associated with the invertedindex or an partition identifier 2125 identifying the partitionassociated with the inverted index 2107A . . . 2107B, and 2109A . . .2109B. However, each inverted index 2107A . . . 2107B, and 2109A . . .2109B can include less or more information than depicted.

Token entries, such as token entries 2111 illustrated in inverted index2107B, can include a token 2111A (e.g., “error,” “itemID,” etc.) andevent references 2111B indicative of events that include the token. Forexample, for the token “error,” the corresponding token entry includesthe token “error” and an event reference, or unique identifier, for eachevent stored in the corresponding time-series bucket that includes thetoken “error.” In the illustrated embodiment of FIG. 21B, the errortoken entry includes the identifiers 3, 5, 6, 8, 11, and 12corresponding to events located in the time-series bucket associatedwith the inverted index 2107B that is stored in common storage 216, asearch node 506, or an indexing node 404 and is associated with thepartition _main 2103.

In some cases, some token entries can be default entries, automaticallydetermined entries, or user specified entries. In some embodiments, theindexing system 212 can identify each word or string in an event as adistinct token and generate a token entry for the identified word orstring. In some cases, the indexing system 212 can identify thebeginning and ending of tokens based on punctuation, spaces, asdescribed in greater detail herein. In certain cases, the indexingsystem 212 can rely on user input or a configuration file to identifytokens for token entries 2111, etc. It will be understood that anycombination of token entries can be included as a default, automaticallydetermined, a or included based on user-specified criteria.

Similarly, field-value pair entries, such as field-value pair entries2113 shown in inverted index 2107B, can include a field-value pair 2113Aand event references 2113B indicative of events that include a fieldvalue that corresponds to the field-value pair. For example, for afield-value pair sourcetype::sendmail, a field-value pair entry caninclude the field-value pair sourcetype::sendmail and a uniqueidentifier, or event reference, for each event stored in thecorresponding time-series bucket that includes a sendmail sourcetype.

In some cases, the field-value pair entries 2113 can be default entries,automatically determined entries, or user specified entries. As anon-limiting example, the field-value pair entries for the fields host,source, sourcetype can be included in the inverted indexes 2107A . . .2107B, and 2109A . . . 2109B as a default. As such, all of the invertedindexes 2107A . . . 2107B, and 2109A . . . 2109B can include field-valuepair entries for the fields host, source, sourcetype. As yet anothernon-limiting example, the field-value pair entries for the IP_addressfield can be user specified and may only appear in the inverted index2107B based on user-specified criteria. As another non-limiting example,as the indexing system 212 indexes the events, it can automaticallyidentify field-value pairs and create field-value pair entries. Forexample, based on the indexing system's 212 review of events, it canidentify IP_address as a field in each event and add the IP_addressfield-value pair entries to the inverted index 2107B. It will beunderstood that any combination of field-value pair entries can beincluded as a default, automatically determined, or included based onuser-specified criteria.

Each unique identifier 2117, or event reference, can correspond to aunique event located in the time series bucket. However, the same eventreference can be located in multiple entries. For example if an eventhas a sourcetype splunkd, host www1 and token “warning,” then the uniqueidentifier for the event will appear in the field-value pair entriessourcetype::splunkd and host::www1, as well as the token entry“warning.” With reference to the illustrated embodiment of FIG. 21B andthe event that corresponds to the event reference 3, the event reference3 is found in the field-value pair entries 2113 host::hostA,source::sourceB, sourcetype::sourcetypeA, and IP_address::91.205.189.15indicating that the event corresponding to the event references is fromhostA, sourceB, of sourcetypeA, and includes 91.205.189.15 in the eventdata.

For some fields, the unique identifier is located in only onefield-value pair entry for a particular field. For example, the invertedindex may include four sourcetype field-value pair entries correspondingto four different sourcetypes of the events stored in a bucket (e.g.,sourcetypes: sendmail, splunkd, web_access, and web_service). Withinthose four sourcetype field-value pair entries, an identifier for aparticular event may appear in only one of the field-value pair entries.With continued reference to the example illustrated embodiment of FIG.21B, since the event reference 7 appears in the field-value pair entrysourcetype::sourcetypeA, then it does not appear in the otherfield-value pair entries for the sourcetype field, includingsourcetype::sourcetypeB, sourcetype::sourcetypeC, andsourcetype::sourcetypeD.

The event references 2117 can be used to locate the events in thecorresponding bucket. For example, the inverted index can include, or beassociated with, an event reference array 2115. The event referencearray 2115 can include an array entry 2117 for each event reference inthe inverted index 2107B. Each array entry 2117 can include locationinformation 2119 of the event corresponding to the unique identifier(non-limiting example: seek address of the event), a timestamp 2121associated with the event, or additional information regarding the eventassociated with the event reference, etc.

For each token entry 2111 or field-value pair entry 2113, the eventreference 2101B or unique identifiers can be listed in chronologicalorder or the value of the event reference can be assigned based onchronological data, such as a timestamp associated with the eventreferenced by the event reference. For example, the event reference 1 inthe illustrated embodiment of FIG. 21B can correspond to thefirst-in-time event for the bucket, and the event reference 12 cancorrespond to the last-in-time event for the bucket. However, the eventreferences can be listed in any order, such as reverse chronologicalorder, ascending order, descending order, or some other order, etc.Further, the entries can be sorted. For example, the entries can besorted alphabetically (collectively or within a particular group), byentry origin (e.g., default, automatically generated, user-specified,etc.), by entry type (e.g., field-value pair entry, token entry, etc.),or chronologically by when added to the inverted index, etc. In theillustrated embodiment of FIG. 21B, the entries are sorted first byentry type and then alphabetically.

As a non-limiting example of how the inverted indexes 2107A . . . 2107B,and 2109A . . . 2109B can be used during a data categorization requestcommand, the query system 214 can receive filter criteria indicatingdata that is to be categorized and categorization criteria indicatinghow the data is to be categorized. Example filter criteria can include,but is not limited to, indexes (or partitions), hosts, sources,sourcetypes, time ranges, field identifier, tenant and/or useridentifiers, keywords, etc.

Using the filter criteria, the query system 214 identifies relevantinverted indexes to be searched. For example, if the filter criteriaincludes a set of partitions (also referred to as indexes), the querysystem 214 can identify the inverted indexes stored in the directorycorresponding to the particular partition as relevant inverted indexes.Other means can be used to identify inverted indexes associated with apartition of interest. For example, in some embodiments, the querysystem 214 can review an entry in the inverted indexes, such as anpartition-value pair entry 2113 to determine if a particular invertedindex is relevant. If the filter criteria does not identify anypartition, then the query system 214 can identify all inverted indexesmanaged by the query system 214 as relevant inverted indexes.

Similarly, if the filter criteria includes a time range, the querysystem 214 can identify inverted indexes corresponding to buckets thatsatisfy at least a portion of the time range as relevant invertedindexes. For example, if the time range is last hour then the querysystem 214 can identify all inverted indexes that correspond to bucketsstoring events associated with timestamps within the last hour asrelevant inverted indexes.

When used in combination, an index filter criterion specifying one ormore partitions and a time range filter criterion specifying aparticular time range can be used to identify a subset of invertedindexes within a particular directory (or otherwise associated with aparticular partition) as relevant inverted indexes. As such, the querysystem 214 can focus the processing to only a subset of the total numberof inverted indexes in the data intake and query system 108.

Once the relevant inverted indexes are identified, the query system 214can review them using any additional filter criteria to identify eventsthat satisfy the filter criteria. In some cases, using the knownlocation of the directory in which the relevant inverted indexes arelocated, the query system 214 can determine that any events identifiedusing the relevant inverted indexes satisfy an index filter criterion.For example, if the filter criteria includes a partition main, then thequery system 214 can determine that any events identified using invertedindexes within the partition main directory (or otherwise associatedwith the partition main) satisfy the index filter criterion.

Furthermore, based on the time range associated with each invertedindex, the query system 214 can determine that that any eventsidentified using a particular inverted index satisfies a time rangefilter criterion. For example, if a time range filter criterion is forthe last hour and a particular inverted index corresponds to eventswithin a time range of 50 minutes ago to 35 minutes ago, the querysystem 214 can determine that any events identified using the particularinverted index satisfy the time range filter criterion. Conversely, ifthe particular inverted index corresponds to events within a time rangeof 59 minutes ago to 62 minutes ago, the query system 214 can determinethat some events identified using the particular inverted index may notsatisfy the time range filter criterion.

Using the inverted indexes, the query system 214 can identify eventreferences (and therefore events) that satisfy the filter criteria. Forexample, if the token “error” is a filter criterion, the query system214 can track all event references within the token entry “error.”Similarly, the query system 214 can identify other event referenceslocated in other token entries or field-value pair entries that matchthe filter criteria. The system can identify event references located inall of the entries identified by the filter criteria. For example, ifthe filter criteria include the token “error” and field-value pairsourcetype::web_ui, the query system 214 can track the event referencesfound in both the token entry “error” and the field-value pair entrysourcetype::web_ui. As mentioned previously, in some cases, such as whenmultiple values are identified for a particular filter criterion (e.g.,multiple sources for a source filter criterion), the system can identifyevent references located in at least one of the entries corresponding tothe multiple values and in all other entries identified by the filtercriteria. The query system 214 can determine that the events associatedwith the identified event references satisfy the filter criteria.

In some cases, the query system 214 can further consult a timestampassociated with the event reference to determine whether an eventsatisfies the filter criteria. For example, if an inverted indexcorresponds to a time range that is partially outside of a time rangefilter criterion, then the query system 214 can consult a timestampassociated with the event reference to determine whether thecorresponding event satisfies the time range criterion. In someembodiments, to identify events that satisfy a time range, the querysystem 214 can review an array, such as the event reference array 2115that identifies the time associated with the events. Furthermore, asmentioned above using the known location of the directory in which therelevant inverted indexes are located (or other partition identifier),the query system 214 can determine that any events identified using therelevant inverted indexes satisfy the index filter criterion.

In some cases, based on the filter criteria, the query system 214reviews an extraction rule. In certain embodiments, if the filtercriteria includes a field name that does not correspond to a field-valuepair entry in an inverted index, the query system 214 can review anextraction rule, which may be located in a configuration file, toidentify a field that corresponds to a field-value pair entry in theinverted index.

For example, the filter criteria includes a field name “sessionID” andthe query system 214 determines that at least one relevant invertedindex does not include a field-value pair entry corresponding to thefield name sessionID, the query system 214 can review an extraction rulethat identifies how the sessionID field is to be extracted from aparticular host, source, or sourcetype (implicitly identifying theparticular host, source, or sourcetype that includes a sessionID field).The query system 214 can replace the field name “sessionID” in thefilter criteria with the identified host, source, or sourcetype. In somecases, the field name “sessionID” may be associated with multipleshosts, sources, or sourcetypes, in which case, all identified hosts,sources, and sourcetypes can be added as filter criteria. In some cases,the identified host, source, or sourcetype can replace or be appended toa filter criterion, or be excluded. For example, if the filter criteriaincludes a criterion for source S1 and the “sessionID” field is found insource S2, the source S2 can replace S1 in the filter criteria, beappended such that the filter criteria includes source S1 and source S2,or be excluded based on the presence of the filter criterion source S1.If the identified host, source, or sourcetype is included in the filtercriteria, the query system 214 can then identify a field-value pairentry in the inverted index that includes a field value corresponding tothe identity of the particular host, source, or sourcetype identifiedusing the extraction rule.

Once the events that satisfy the filter criteria are identified, thequery system 214 can categorize the results based on the categorizationcriteria. The categorization criteria can include categories forgrouping the results, such as any combination of partition, source,sourcetype, or host, or other categories or fields as desired.

The query system 214 can use the categorization criteria to identifycategorization criteria-value pairs or categorization criteria values bywhich to categorize or group the results. The categorizationcriteria-value pairs can correspond to one or more field-value pairentries stored in a relevant inverted index, one or more partition-valuepairs based on a directory in which the inverted index is located or anentry in the inverted index (or other means by which an inverted indexcan be associated with a partition), or other criteria-value pair thatidentifies a general category and a particular value for that category.The categorization criteria values can correspond to the value portionof the categorization criteria-value pair.

As mentioned, in some cases, the categorization criteria-value pairs cancorrespond to one or more field-value pair entries stored in therelevant inverted indexes. For example, the categorizationcriteria-value pairs can correspond to field-value pair entries of host,source, and sourcetype (or other field-value pair entry as desired). Forinstance, if there are ten different hosts, four different sources, andfive different sourcetypes for an inverted index, then the invertedindex can include ten host field-value pair entries, four sourcefield-value pair entries, and five sourcetype field-value pair entries.The query system 214 can use the nineteen distinct field-value pairentries as categorization criteria-value pairs to group the results.

Specifically, the query system 214 can identify the location of theevent references associated with the events that satisfy the filtercriteria within the field-value pairs, and group the event referencesbased on their location. As such, the query system 214 can identify theparticular field value associated with the event corresponding to theevent reference. For example, if the categorization criteria includehost and sourcetype, the host field-value pair entries and sourcetypefield-value pair entries can be used as categorization criteria-valuepairs to identify the specific host and sourcetype associated with theevents that satisfy the filter criteria.

In addition, as mentioned, categorization criteria-value pairs cancorrespond to data other than the field-value pair entries in therelevant inverted indexes. For example, if partition or index is used asa categorization criterion, the inverted indexes may not includepartition field-value pair entries. Rather, the query system 214 canidentify the categorization criteria-value pair associated with thepartition based on the directory in which an inverted index is located,information in the inverted index, or other information that associatesthe inverted index with the partition, etc. As such a variety of methodscan be used to identify the categorization criteria-value pairs from thecategorization criteria.

Accordingly based on the categorization criteria (and categorizationcriteria-value pairs), the query system 214 can generate groupings basedon the events that satisfy the filter criteria. As a non-limitingexample, if the categorization criteria includes a partition andsourcetype, then the groupings can correspond to events that areassociated with each unique combination of partition and sourcetype. Forinstance, if there are three different partitions and two differentsourcetypes associated with the identified events, then the sixdifferent groups can be formed, each with a unique partitionvalue-sourcetype value combination. Similarly, if the categorizationcriteria includes partition, sourcetype, and host and there are twodifferent partitions, three sourcetypes, and five hosts associated withthe identified events, then the query system 214 can generate up tothirty groups for the results that satisfy the filter criteria. Eachgroup can be associated with a unique combination of categorizationcriteria-value pairs (e.g., unique combinations of partition valuesourcetype value, and host value).

In addition, the query system 214 can count the number of eventsassociated with each group based on the number of events that meet theunique combination of categorization criteria for a particular group (ormatch the categorization criteria-value pairs for the particular group).With continued reference to the example above, the query system 214 cancount the number of events that meet the unique combination ofpartition, sourcetype, and host for a particular group.

The query system 214, such as the search head 504 can aggregate thegroupings from the buckets, or search nodes 506, and provide thegroupings for display. In some cases, the groups are displayed based onat least one of the host, source, sourcetype, or partition associatedwith the groupings. In some embodiments, the query system 214 canfurther display the groups based on display criteria, such as a displayorder or a sort order as described in greater detail above.

As a non-limiting example and with reference to FIG. 21B, consider arequest received by the query system 214 that includes the followingfilter criteria: keyword=error, partition=_main, time range=3/1/1716:22.00.000-16:28.00.000, sourcetype=sourcetypeC, host=hostB, and thefollowing categorization criteria: source.

Based on the above criteria, a search node 506 of the query system 214that is associated with the data store 2101 identifies _main directory2103 and can ignore _test directory 2105 and any otherpartition-specific directories. The search node 506 determines thatinverted index 2107B is a relevant index based on its location withinthe _main directory 2103 and the time range associated with it. For sakeof simplicity in this example, the search node 506 determines that noother inverted indexes in the _main directory 2103, such as invertedindex 2107A satisfy the time range criterion.

Having identified the relevant inverted index 2107B, the search node 506reviews the token entries 2111 and the field-value pair entries 2113 toidentify event references, or events, that satisfy all of the filtercriteria.

With respect to the token entries 2111, the search node 506 can reviewthe error token entry and identify event references 3, 5, 6, 8, 11, 12,indicating that the term “error” is found in the corresponding events.Similarly, the search node 506 can identify event references 4, 5, 6, 8,9, 10, 11 in the field-value pair entry sourcetype::sourcetypeC andevent references 2, 5, 6, 8, 10, 11 in the field-value pair entryhost::hostB. As the filter criteria did not include a source or anIP_address field-value pair, the search node 506 can ignore thosefield-value pair entries.

In addition to identifying event references found in at least one tokenentry or field-value pair entry (e.g., event references 3, 4, 5, 6, 8,9, 10, 11, 12), the search node 506 can identify events (andcorresponding event references) that satisfy the time range criterionusing the event reference array 2115 (e.g., event references 2, 3, 4, 5,6, 7, 8, 9, 10). Using the information obtained from the inverted index2107B (including the event reference array 2115), the search node 506can identify the event references that satisfy all of the filtercriteria (e.g., event references 5, 6, 8).

Having identified the events (and event references) that satisfy all ofthe filter criteria, the search node 506 can group the event referencesusing the received categorization criteria (source). In doing so, thesearch node 506 can determine that event references 5 and 6 are locatedin the field-value pair entry source::sourceD (or have matchingcategorization criteria-value pairs) and event reference 8 is located inthe field-value pair entry source::sourceC. Accordingly, the search node506 can generate a sourceC group having a count of one corresponding toreference 8 and a sourceD group having a count of two corresponding toreferences 5 and 6. This information can be communicated to the searchhead 504. In turn the search head 504 can aggregate the results from thevarious search nodes 506 and display the groupings. As mentioned above,in some embodiments, the groupings can be displayed based at least inpart on the categorization criteria, including at least one of host,source, sourcetype, or partition.

It will be understood that a change to any of the filter criteria orcategorization criteria can result in different groupings. As a onenon-limiting example, consider a request received by a search node 506that includes the following filter criteria: partition=_main, timerange=3/1/17 3/1/17 16:21:20.000-16:28:17.000, and the followingcategorization criteria: host, source, sourcetype can result in thesearch node 506 identifying event references 1-12 as satisfying thefilter criteria. The search node 506 can generate up to 24 groupingscorresponding to the 24 different combinations of the categorizationcriteria-value pairs, including host (hostA, hostB), source (sourceA,sourceB, sourceC, sourceD), and sourcetype (sourcetypeA, sourcetypeB,sourcetypeC). However, as there are only twelve events identifiers inthe illustrated embodiment and some fall into the same grouping, thesearch node 506 generates eight groups and counts as follows:

-   -   Group 1 (hostA, sourceA, sourcetypeA): 1 (event reference 7)    -   Group 2 (hostA, sourceA, sourcetypeB): 2 (event references 1,        12)    -   Group 3 (hostA, sourceA, sourcetypeC): 1 (event reference 4)    -   Group 4 (hostA, sourceB, sourcetypeA): 1 (event reference 3)    -   Group 5 (hostA, sourceB, sourcetypeC): 1 (event reference 9)    -   Group 6 (hostB, sourceC, sourcetypeA): 1 (event reference 2)    -   Group 7 (hostB, sourceC, sourcetypeC): 2 (event references 8,        11)    -   Group 8 (hostB, sourceD, sourcetypeC): 3 (event references 5, 6,        10)

As noted, each group has a unique combination of categorizationcriteria-value pairs or categorization criteria values. The search node506 communicates the groups to the search head 504 for aggregation withresults received from other search nodes 506. In communicating thegroups to the search head 504, the search node 506 can include thecategorization criteria-value pairs for each group and the count. Insome embodiments, the search node 506 can include more or lessinformation. For example, the search node 506 can include the eventreferences associated with each group and other identifying information,such as the search node 506 or inverted index used to identify thegroups.

As another non-limiting example, consider a request received by ansearch node 506 that includes the following filter criteria:partition=_main, time range=3/1/17 3/1/17 16:21:20.000-16:28:17.000,source=sourceA, sourceD, and keyword=itemID and the followingcategorization criteria: host, source, sourcetype can result in thesearch node identifying event references 4, 7, and 10 as satisfying thefilter criteria, and generate the following groups:

-   -   Group 1 (hostA, sourceA, sourcetypeC): 1 (event reference 4)    -   Group 2 (hostA, sourceA, sourcetypeA): 1 (event reference 7)    -   Group 3 (hostB, sourceD, sourcetypeC): 1 (event references 10)

The search node 506 communicates the groups to the search head 504 foraggregation with results received from other search node 506s. As willbe understand there are myriad ways for filtering and categorizing theevents and event references. For example, the search node 506 can reviewmultiple inverted indexes associated with an partition or review theinverted indexes of multiple partitions, and categorize the data usingany one or any combination of partition, host, source, sourcetype, orother category, as desired.

Further, if a user interacts with a particular group, the search node506 can provide additional information regarding the group. For example,the search node 506 can perform a targeted search or sampling of theevents that satisfy the filter criteria and the categorization criteriafor the selected group, also referred to as the filter criteriacorresponding to the group or filter criteria associated with the group.

In some cases, to provide the additional information, the search node506 relies on the inverted index. For example, the search node 506 canidentify the event references associated with the events that satisfythe filter criteria and the categorization criteria for the selectedgroup and then use the event reference array 2115 to access some or allof the identified events. In some cases, the categorization criteriavalues or categorization criteria-value pairs associated with the groupbecome part of the filter criteria for the review.

With reference to FIG. 21B for instance, suppose a group is displayedwith a count of six corresponding to event references 4, 5, 6, 8, 10, 11(i.e., event references 4, 5, 6, 8, 10, 11 satisfy the filter criteriaand are associated with matching categorization criteria values orcategorization criteria-value pairs) and a user interacts with the group(e.g., selecting the group, clicking on the group, etc.). In response,the search head 504 communicates with the search node 506 to provideadditional information regarding the group.

In some embodiments, the search node 506 identifies the event referencesassociated with the group using the filter criteria and thecategorization criteria for the group (e.g., categorization criteriavalues or categorization criteria-value pairs unique to the group).Together, the filter criteria and the categorization criteria for thegroup can be referred to as the filter criteria associated with thegroup. Using the filter criteria associated with the group, the searchnode 506 identifies event references 4, 5, 6, 8, 10, 11.

Based on a sampling criteria, discussed in greater detail above, thesearch node 506 can determine that it will analyze a sample of theevents associated with the event references 4, 5, 6, 8, 10, 11. Forexample, the sample can include analyzing event data associated with theevent references 5, 8, 10. In some embodiments, the search node 506 canuse the event reference array 2115 to access the event data associatedwith the event references 5, 8, 10. Once accessed, the search node 506can compile the relevant information and provide it to the search head504 for aggregation with results from other search nodes. By identifyingevents and sampling event data using the inverted indexes, the searchnode can reduce the amount of actual data this is analyzed and thenumber of events that are accessed in order to generate the summary ofthe group and provide a response in less time.

4.5. Query Processing Flow

FIG. 22A is a flow diagram illustrating an embodiment of a routineimplemented by the query system 214 for executing a query. At block2202, a search head 504 receives a search query. At block 2204, thesearch head 504 analyzes the search query to determine what portion(s)of the query to delegate to search nodes 506 and what portions of thequery to execute locally by the search head 504. At block 2206, thesearch head distributes the determined portions of the query to theappropriate search nodes 506. In some embodiments, a search head clustermay take the place of an independent search head 504 where each searchhead 504 in the search head cluster coordinates with peer search heads504 in the search head cluster to schedule jobs, replicate searchresults, update configurations, fulfill search requests, etc. In someembodiments, the search head 504 (or each search head) consults with asearch node catalog 510 that provides the search head with a list ofsearch nodes 506 to which the search head can distribute the determinedportions of the query. A search head 504 may communicate with the searchnode catalog 510 to discover the addresses of active search nodes 506.

At block 2208, the search nodes 506 to which the query was distributed,search data stores associated with them for events that are responsiveto the query. To determine which events are responsive to the query, thesearch node 506 searches for events that match the criteria specified inthe query. These criteria can include matching keywords or specificvalues for certain fields. The searching operations at block 2208 mayuse the late-binding schema to extract values for specified fields fromevents at the time the query is processed. In some embodiments, one ormore rules for extracting field values may be specified as part of asource type definition in a configuration file. The search nodes 506 maythen either send the relevant events back to the search head 504, or usethe events to determine a partial result, and send the partial resultback to the search head 504.

At block 2210, the search head 504 combines the partial results and/orevents received from the search nodes 506 to produce a final result forthe query. In some examples, the results of the query are indicative ofperformance or security of the IT environment and may help improve theperformance of components in the IT environment. This final result maycomprise different types of data depending on what the query requested.For example, the results can include a listing of matching eventsreturned by the query, or some type of visualization of the data fromthe returned events. In another example, the final result can includeone or more calculated values derived from the matching events.

The results generated by the system 108 can be returned to a clientusing different techniques. For example, one technique streams resultsor relevant events back to a client in real-time as they are identified.Another technique waits to report the results to the client until acomplete set of results (which may include a set of relevant events or aresult based on relevant events) is ready to return to the client. Yetanother technique streams interim results or relevant events back to theclient in real-time until a complete set of results is ready, and thenreturns the complete set of results to the client. In another technique,certain results are stored as “search jobs” and the client may retrievethe results by referring the search jobs.

The search head 504 can also perform various operations to make thesearch more efficient. For example, before the search head 504 beginsexecution of a query, the search head 504 can determine a time range forthe query and a set of common keywords that all matching events include.The search head 504 may then use these parameters to query the searchnodes 506 to obtain a superset of the eventual results. Then, during afiltering stage, the search head 504 can perform field-extractionoperations on the superset to produce a reduced set of search results.This speeds up queries, which may be particularly helpful for queriesthat are performed on a periodic basis.

4.6. Pipelined Search Language

Various embodiments of the present disclosure can be implemented using,or in conjunction with, a pipelined command language. A pipelinedcommand language is a language in which a set of inputs or data isoperated on by a first command in a sequence of commands, and thensubsequent commands in the order they are arranged in the sequence. Suchcommands can include any type of functionality for operating on data,such as retrieving, searching, filtering, aggregating, processing,transmitting, and the like. As described herein, a query can thus beformulated in a pipelined command language and include any number ofordered or unordered commands for operating on data.

Splunk Processing Language (SPL) is an example of a pipelined commandlanguage in which a set of inputs or data is operated on by any numberof commands in a particular sequence. A sequence of commands, or commandsequence, can be formulated such that the order in which the commandsare arranged defines the order in which the commands are applied to aset of data or the results of an earlier executed command For example, afirst command in a command sequence can operate to search or filter forspecific data in particular set of data. The results of the firstcommand can then be passed to another command listed later in thecommand sequence for further processing.

In various embodiments, a query can be formulated as a command sequencedefined in a command line of a search UI. In some embodiments, a querycan be formulated as a sequence of SPL commands Some or all of the SPLcommands in the sequence of SPL commands can be separated from oneanother by a pipe symbol “|”. In such embodiments, a set of data, suchas a set of events, can be operated on by a first SPL command in thesequence, and then a subsequent SPL command following a pipe symbol “|”after the first SPL command operates on the results produced by thefirst SPL command or other set of data, and so on for any additional SPLcommands in the sequence. As such, a query formulated using SPLcomprises a series of consecutive commands that are delimited by pipe“|” characters. The pipe character indicates to the system that theoutput or result of one command (to the left of the pipe) should be usedas the input for one of the subsequent commands (to the right of thepipe). This enables formulation of queries defined by a pipeline ofsequenced commands that refines or enhances the data at each step alongthe pipeline until the desired results are attained. Accordingly,various embodiments described herein can be implemented with SplunkProcessing Language (SPL) used in conjunction with the SPLUNK®ENTERPRISE system.

While a query can be formulated in many ways, a query can start with asearch command and one or more corresponding search terms at thebeginning of the pipeline. Such search terms can include any combinationof keywords, phrases, times, dates, Boolean expressions, fieldname-fieldvalue pairs, etc. that specify which results should be obtained from anindex. The results can then be passed as inputs into subsequent commandsin a sequence of commands by using, for example, a pipe character. Thesubsequent commands in a sequence can include directives for additionalprocessing of the results once it has been obtained from one or moreindexes. For example, commands may be used to filter unwantedinformation out of the results, extract more information, evaluate fieldvalues, calculate statistics, reorder the results, create an alert,create summary of the results, or perform some type of aggregationfunction. In some embodiments, the summary can include a graph, chart,metric, or other visualization of the data. An aggregation function caninclude analysis or calculations to return an aggregate value, such asan average value, a sum, a maximum value, a root mean square,statistical values, and the like.

Due to its flexible nature, use of a pipelined command language invarious embodiments is advantageous because it can perform “filtering”as well as “processing” functions. In other words, a single query caninclude a search command and search term expressions, as well asdata-analysis expressions. For example, a command at the beginning of aquery can perform a “filtering” step by retrieving a set of data basedon a condition (e.g., records associated with server response times ofless than 1 microsecond). The results of the filtering step can then bepassed to a subsequent command in the pipeline that performs a“processing” step (e.g. calculation of an aggregate value related to thefiltered events such as the average response time of servers withresponse times of less than 1 microsecond). Furthermore, the searchcommand can allow events to be filtered by keyword as well as fieldvalue criteria. For example, a search command can filter out all eventscontaining the word “warning” or filter out all events where a fieldvalue associated with a field “clientip” is “10.0.1.2.”

The results obtained or generated in response to a command in a querycan be considered a set of results data. The set of results data can bepassed from one command to another in any data format. In oneembodiment, the set of result data can be in the form of a dynamicallycreated table. Each command in a particular query can redefine the shapeof the table. In some implementations, an event retrieved from an indexin response to a query can be considered a row with a column for eachfield value. Columns contain basic information about the data and alsomay contain data that has been dynamically extracted at search time.

FIG. 22B provides a visual representation of the manner in which apipelined command language or query operates in accordance with thedisclosed embodiments. The query 2230 can be inputted by the user into asearch. The query comprises a search, the results of which are piped totwo commands (namely, command 1 and command 2) that follow the searchstep.

Disk 2222 represents the event data in the raw record data store.

When a user query is processed, a search step will precede other queriesin the pipeline in order to generate a set of events at block 2240. Forexample, the query can comprise search terms “sourcetype=syslog ERROR”at the front of the pipeline as shown in FIG. 22B. Intermediate resultstable 2224 shows fewer rows because it represents the subset of eventsretrieved from the index that matched the search terms“sourcetype=syslog ERROR” from search command 2230. By way of furtherexample, instead of a search step, the set of events at the head of thepipeline may be generating by a call to a pre-existing inverted index(as will be explained later).

At block 2242, the set of events generated in the first part of thequery may be piped to a query that searches the set of events forfield-value pairs or for keywords. For example, the second intermediateresults table 2226 shows fewer columns, representing the result of thetop command, “top user” which summarizes the events into a list of thetop 10 users and displays the user, count, and percentage.

Finally, at block 2244, the results of the prior stage can be pipelinedto another stage where further filtering or processing of the data canbe performed, e.g., preparing the data for display purposes, filteringthe data based on a condition, performing a mathematical calculationwith the data, etc. As shown in FIG. 22B, the “fields—percent” part ofcommand 2230 removes the column that shows the percentage, thereby,leaving a final results table 2228 without a percentage column. Indifferent embodiments, other query languages, such as the StructuredQuery Language (“SQL”), can be used to create a query.

4.7. Field Extraction

The query system 214 allows users to search and visualize eventsgenerated from machine data received from homogenous data sources. Thequery system 214 also allows users to search and visualize eventsgenerated from machine data received from heterogeneous data sources.The query system 214 includes various components for processing a query,such as, but not limited to a query system manager 502, one or moresearch heads 504 having one or more search masters 512 and searchmanagers 514, and one or more search nodes 506. A query language may beused to create a query, such as any suitable pipelined query language.For example, Splunk Processing Language (SPL) can be utilized to make aquery. SPL is a pipelined search language in which a set of inputs isoperated on by a first command in a command line, and then a subsequentcommand following the pipe symbol “|” operates on the results producedby the first command, and so on for additional commands. Other querylanguages, such as the Structured Query Language (“SQL”), can be used tocreate a query.

In response to receiving the search query, a search head 504 (e.g., asearch master 512 or search manager 514) can use extraction rules toextract values for fields in the events being searched. The search head504 can obtain extraction rules that specify how to extract a value forfields from an event. Extraction rules can comprise regex rules thatspecify how to extract values for the fields corresponding to theextraction rules. In addition to specifying how to extract field values,the extraction rules may also include instructions for deriving a fieldvalue by performing a function on a character string or value retrievedby the extraction rule. For example, an extraction rule may truncate acharacter string or convert the character string into a different dataformat. In some cases, the query itself can specify one or moreextraction rules.

The search head 504can apply the extraction rules to events that itreceives from search nodes 506. The search nodes 506 may apply theextraction rules to events in an associated data store or common storage216. Extraction rules can be applied to all the events in a data storeor common storage 216 or to a subset of the events that have beenfiltered based on some criteria (e.g., event time stamp values, etc.).Extraction rules can be used to extract one or more values for a fieldfrom events by parsing the portions of machine data in the events andexamining the data for one or more patterns of characters, numbers,delimiters, etc., that indicate where the field begins and, optionally,ends.

FIG. 23A is a diagram of an example scenario where a common customeridentifier is found among log data received from three disparate datasources, in accordance with example embodiments. In this example, a usersubmits an order for merchandise using a vendor's shopping applicationprogram 2301 running on the user's system. In this example, the orderwas not delivered to the vendor's server due to a resource exception atthe destination server that is detected by the middleware code 2302. Theuser then sends a message to the customer support server 2303 tocomplain about the order failing to complete. The three systems 2301,2302, and 2303 are disparate systems that do not have a common loggingformat. The order application 2301 sends log data 2304 to the dataintake and query system 108 in one format, the middleware code 2302sends error log data 2305 in a second format, and the support server2303 sends log data 2306 in a third format.

Using the log data received at the data intake and query system 108 fromthe three systems, the vendor can uniquely obtain an insight into useractivity, user experience, and system behavior. The query system 214allows the vendor's administrator to search the log data from the threesystems, thereby obtaining correlated information, such as the ordernumber and corresponding customer ID number of the person placing theorder. The system also allows the administrator to see a visualizationof related events via a user interface. The administrator can query thequery system 214 for customer ID field value matches across the log datafrom the three systems that are stored in common storage 216. Thecustomer ID field value exists in the data gathered from the threesystems, but the customer ID field value may be located in differentareas of the data given differences in the architecture of the systems.There is a semantic relationship between the customer ID field valuesgenerated by the three systems. The query system 214 requests eventsfrom the one or more data stores 218 to gather relevant events from thethree systems. The search head 504then applies extraction rules to theevents in order to extract field values that it can correlate. Thesearch head 504 may apply a different extraction rule to each set ofevents from each system when the event format differs among systems. Inthis example, the user interface can display to the administrator theevents corresponding to the common customer ID field values 2307, 2308,and 2309, thereby providing the administrator with insight into acustomer's experience.

Note that query results can be returned to a client, a search head 504,or any other system component for further processing. In general, queryresults may include a set of one or more events, a set of one or morevalues obtained from the events, a subset of the values, statisticscalculated based on the values, a report containing the values, avisualization (e.g., a graph or chart) generated from the values, andthe like.

The query system 214 enables users to run queries against the storeddata to retrieve events that meet criteria specified in a query, such ascontaining certain keywords or having specific values in defined fields.FIG. 23B illustrates the manner in which keyword searches and fieldsearches are processed in accordance with disclosed embodiments.

If a user inputs a search query into search bar 2310 that includes onlykeywords (also known as “tokens”), e.g., the keyword “error” or“warning”, the query system 214 of the data intake and query system 108can search for those keywords directly in the event data 2311 stored inthe raw record data store. Note that while FIG. 23B only illustratesfour events 2312, 2313, 2314, 2315, the raw record data store(corresponding to data store 212 in FIG. 2 ) may contain records formillions of events.

As disclosed above, the indexing system 212 can optionally generate akeyword index to facilitate fast keyword searching for event data. Theindexing system 212 can include the identified keywords in an index,which associates each stored keyword with reference pointers to eventscontaining that keyword (or to locations within events where thatkeyword is located, other location identifiers, etc.). When the querysystem 214 subsequently receives a keyword-based query, the query system214 can access the keyword index to quickly identify events containingthe keyword. For example, if the keyword “HTTP” was indexed by theindexing system 212 at index time, and the user searches for the keyword“HTTP”, the events 2312, 2313, and 2314, will be identified based on theresults returned from the keyword index. As noted above, the indexcontains reference pointers to the events containing the keyword, whichallows for efficient retrieval of the relevant events from the rawrecord data store.

If a user searches for a keyword that has not been indexed by theindexing system 212, the data intake and query system 108 maynevertheless be able to retrieve the events by searching the event datafor the keyword in the raw record data store directly as shown in FIG.23B. For example, if a user searches for the keyword “frank”, and thename “frank” has not been indexed at search time, the query system 214can search the event data directly and return the first event 2312. Notethat whether the keyword has been indexed at index time or search timeor not, in both cases the raw data with the events 2311 is accessed fromthe raw data record store to service the keyword search. In the casewhere the keyword has been indexed, the index will contain a referencepointer that will allow for a more efficient retrieval of the event datafrom the data store. If the keyword has not been indexed, the querysystem 214 can search through the records in the data store to servicethe search.

In most cases, however, in addition to keywords, a user's search willalso include fields. The term “field” refers to a location in the eventdata containing one or more values for a specific data item. Often, afield is a value with a fixed, delimited position on a line, or a nameand value pair, where there is a single value to each field name. Afield can also be multivalued, that is, it can appear more than once inan event and have a different value for each appearance, e.g , emailaddress fields. Fields are searchable by the field name or fieldname-value pairs. Some examples of fields are “clientip” for IPaddresses accessing a web server, or the “From” and “To” fields in emailaddresses.

By way of further example, consider the search, “status=404”. Thissearch query finds events with “status” fields that have a value of“404.” When the search is run, the query system 214 does not look forevents with any other “status” value. It also does not look for eventscontaining other fields that share “404” as a value. As a result, thesearch returns a set of results that are more focused than if “404” hadbeen used in the search string as part of a keyword search. Note alsothat fields can appear in events as “key=value” pairs such as“user_name=Bob.” But in most cases, field values appear in fixed,delimited positions without identifying keys. For example, the datastore may contain events where the “user_name” value always appears byitself after the timestamp as illustrated by the following string: “Nov15 09:33:22 johnmedlock.”

The data intake and query system 108 advantageously allows for searchtime field extraction. In other words, fields can be extracted from theevent data at search time using late-binding schema as opposed to atdata ingestion time, which was a major limitation of the prior artsystems.

In response to receiving the search query, a search head 504of the querysystem 214 can use extraction rules to extract values for the fieldsassociated with a field or fields in the event data being searched. Thesearch head 504 can obtain extraction rules that specify how to extracta value for certain fields from an event. Extraction rules can compriseregex rules that specify how to extract values for the relevant fields.In addition to specifying how to extract field values, the extractionrules may also include instructions for deriving a field value byperforming a function on a character string or value retrieved by theextraction rule. For example, a transformation rule may truncate acharacter string, or convert the character string into a different dataformat. In some cases, the query itself can specify one or moreextraction rules.

FIG. 23B illustrates the manner in which configuration files may be usedto configure custom fields at search time in accordance with thedisclosed embodiments. In response to receiving a search query, the dataintake and query system 108 determines if the query references a“field.” For example, a query may request a list of events where the“clientip” field equals “127.0.0.1.” If the query itself does notspecify an extraction rule and if the field is not a metadata field,e.g., time, host, source, source type, etc., then in order to determinean extraction rule, the query system 214 may, in one or moreembodiments, need to locate configuration file 2316 during the executionof the search as shown in FIG. 23B.

Configuration file 2316 may contain extraction rules for all the variousfields that are not metadata fields, e.g., the “clientip” field. Theextraction rules may be inserted into the configuration file in avariety of ways. In some embodiments, the extraction rules can compriseregular expression rules that are manually entered in by the user.Regular expressions match patterns of characters in text and are usedfor extracting custom fields in text.

In one or more embodiments, as noted above, a field extractor may beconfigured to automatically generate extraction rules for certain fieldvalues in the events when the events are being created, indexed, orstored, or possibly at a later time. In one embodiment, a user may beable to dynamically create custom fields by highlighting portions of asample event that should be extracted as fields using a graphical userinterface. The system can then generate a regular expression thatextracts those fields from similar events and store the regularexpression as an extraction rule for the associated field in theconfiguration file 2316.

In some embodiments, the indexing system 212 can automatically discovercertain custom fields at index time and the regular expressions forthose fields will be automatically generated at index time and stored aspart of extraction rules in configuration file 2316. For example, fieldsthat appear in the event data as “key=value” pairs may be automaticallyextracted as part of an automatic field discovery process. Note thatthere may be several other ways of adding field definitions toconfiguration files in addition to the methods discussed herein.

The search head 504 can apply the extraction rules derived fromconfiguration file 2316 to event data that it receives from search nodes506. The search nodes 506 may apply the extraction rules from theconfiguration file to events in an associated data store or commonstorage 216. Extraction rules can be applied to all the events in a datastore, or to a subset of the events that have been filtered based onsome criteria (e.g., event time stamp values, etc.). Extraction rulescan be used to extract one or more values for a field from events byparsing the event data and examining the event data for one or morepatterns of characters, numbers, delimiters, etc., that indicate wherethe field begins and, optionally, ends.

In one more embodiments, the extraction rule in configuration file 2316will also need to define the type or set of events that the rule appliesto. Because the raw record data store will contain events from multipleheterogeneous sources, multiple events may contain the same fields indifferent locations because of discrepancies in the format of the datagenerated by the various sources. Furthermore, certain events may notcontain a particular field at all. For example, event 2315 also contains“clientip” field, however, the “clientip” field is in a different formatfrom events 2312, 2313, and 2314. To address the discrepancies in theformat and content of the different types of events, the configurationfile will also need to specify the set of events that an extraction ruleapplies to, e.g., extraction rule 2317 specifies a rule for filtering bythe type of event and contains a regular expression for parsing out thefield value. Accordingly, each extraction rule can pertain to only aparticular type of event. If a particular field, e.g., “clientip” occursin multiple types of events, each of those types of events can have itsown corresponding extraction rule in the configuration file 2316 andeach of the extraction rules would comprise a different regularexpression to parse out the associated field value. The most common wayto categorize events is by source type because events generated by aparticular source can have the same format.

The field extraction rules stored in configuration file 2316 performsearch-time field extractions. For example, for a query that requests alist of events with source type “access_combined” where the “clientip”field equals “127.0.0.1,” the query system 214 can first locate theconfiguration file 2316 to retrieve extraction rule 2317 that allows itto extract values associated with the “clientip” field from the eventdata 2320 “where the source type is “access_combined. After the“clientip” field has been extracted from all the events comprising the“clientip” field where the source type is “access_combined,” the querysystem 214 can then execute the field criteria by performing the compareoperation to filter out the events where the “clientip” field equals“127.0.0.1.” In the example shown in FIG. 23B, the events 2312, 2313,and 2314 would be returned in response to the user query. In thismanner, the query system 214 can service queries containing fieldcriteria in addition to queries containing keyword criteria (asexplained above).

In some embodiments, the configuration file 2316 can be created duringindexing. It may either be manually created by the user or automaticallygenerated with certain predetermined field extraction rules. Asdiscussed above, the events may be distributed across several datastores in common storage 216, wherein various indexing nodes 404 may beresponsible for storing the events in the common storage 216 and varioussearch nodes 506 may be responsible for searching the events containedin common storage 216.

The ability to add schema to the configuration file at search timeresults in increased efficiency. A user can create new fields at searchtime and simply add field definitions to the configuration file. As auser learns more about the data in the events, the user can continue torefine the late-binding schema by adding new fields, deleting fields, ormodifying the field extraction rules in the configuration file for usethe next time the schema is used by the system. Because the data intakeand query system 108 maintains the underlying raw data and useslate-binding schema for searching the raw data, it enables a user tocontinue investigating and learn valuable insights about the raw datalong after data ingestion time.

The ability to add multiple field definitions to the configuration fileat search time also results in increased flexibility. For example,multiple field definitions can be added to the configuration file tocapture the same field across events generated by different sourcetypes. This allows the data intake and query system 108 to search andcorrelate data across heterogeneous sources flexibly and efficiently.

Further, by providing the field definitions for the queried fields atsearch time, the configuration file 2316 allows the record data store tobe field searchable. In other words, the raw record data store can besearched using keywords as well as fields, wherein the fields aresearchable name/value pairings that distinguish one event from anotherand can be defined in configuration file 2316 using extraction rules. Incomparison to a search containing field names, a keyword search does notneed the configuration file and can search the event data directly asshown in FIG. 23B.

It should also be noted that any events filtered out by performing asearch-time field extraction using a configuration file 2316 can befurther processed by directing the results of the filtering step to aprocessing step using a pipelined search language. Using the priorexample, a user can pipeline the results of the compare step to anaggregate function by asking the query system 214 to count the number ofevents where the “clientip” field equals “127.0.0.1.”

4.8. Example Search Screen

FIG. 24A is an interface diagram of an example user interface for asearch screen 2400, in accordance with example embodiments. Searchscreen 2400 includes a search bar 2402 that accepts user input in theform of a search string. It also includes a time range picker 2412 thatenables the user to specify a time range for the search. For historicalsearches (e.g., searches based on a particular historical time range),the user can select a specific time range, or alternatively a relativetime range, such as “today,” “yesterday” or “last week.” For real-timesearches (e.g., searches whose results are based on data received inreal-time), the user can select the size of a preceding time window tosearch for real-time events. Search screen 2400 also initially displaysa “data summary” dialog as is illustrated in FIG. 24B that enables theuser to select different sources for the events, such as by selectingspecific hosts and log files.

After the search is executed, the search screen 2400 in FIG. 24A candisplay the results through search results tabs 2404, wherein searchresults tabs 2404 includes: an “events tab” that displays variousinformation about events returned by the search; a “statistics tab” thatdisplays statistics about the search results; and a “visualization tab”that displays various visualizations of the search results. The eventstab illustrated in FIG. 24A displays a timeline graph 2405 thatgraphically illustrates the number of events that occurred in one-hourintervals over the selected time range. The events tab also displays anevents list 2408 that enables a user to view the machine data in each ofthe returned events.

The events tab additionally displays a sidebar that is an interactivefield picker 2406. The field picker 2406 may be displayed to a user inresponse to the search being executed and allows the user to furtheranalyze the search results based on the fields in the events of thesearch results. The field picker 2406 includes field names thatreference fields present in the events in the search results. The fieldpicker may display any Selected Fields 2420 that a user has pre-selectedfor display (e.g., host, source, sourcetype) and may also display anyInteresting Fields 2422 that the system determines may be interesting tothe user based on pre-specified criteria (e.g., action, bytes,categoryid, clientip, date_hour, date_mday, date_minute, etc.). Thefield picker also provides an option to display field names for all thefields present in the events of the search results using the All Fieldscontrol 2424.

Each field name in the field picker 2406 has a value type identifier tothe left of the field name, such as value type identifier 2426. A valuetype identifier identifies the type of value for the respective field,such as an “a” for fields that include literal values or a “#” forfields that include numerical values.

Each field name in the field picker also has a unique value count to theright of the field name, such as unique value count 2428. The uniquevalue count indicates the number of unique values for the respectivefield in the events of the search results.

Each field name is selectable to view the events in the search resultsthat have the field referenced by that field name. For example, a usercan select the “host” field name, and the events shown in the eventslist 2408 will be updated with events in the search results that havethe field that is reference by the field name “host.”

4.9. Data Models

A data model is a hierarchically structured search-time mapping ofsemantic knowledge about one or more datasets. It encodes the domainknowledge used to build a variety of specialized searches of thosedatasets. Those searches, in turn, can be used to generate reports.

A data model is composed of one or more “objects” (or “data modelobjects”) that define or otherwise correspond to a specific set of data.An object is defined by constraints and attributes. An object'sconstraints are search criteria that define the set of events to beoperated on by running a search having that search criteria at the timethe data model is selected. An object's attributes are the set of fieldsto be exposed for operating on that set of events generated by thesearch criteria.

Objects in data models can be arranged hierarchically in parent/childrelationships. Each child object represents a subset of the datasetcovered by its parent object. The top-level objects in data models arecollectively referred to as “root objects.”

Child objects have inheritance. Child objects inherit constraints andattributes from their parent objects and may have additional constraintsand attributes of their own. Child objects provide a way of filteringevents from parent objects. Because a child object may provide anadditional constraint in addition to the constraints it has inheritedfrom its parent object, the dataset it represents may be a subset of thedataset that its parent represents. For example, a first data modelobject may define a broad set of data pertaining to e-mail activitygenerally, and another data model object may define specific datasetswithin the broad dataset, such as a subset of the e-mail data pertainingspecifically to e-mails sent. For example, a user can simply select an“e-mail activity” data model object to access a dataset relating toe-mails generally (e.g., sent or received), or select an “e-mails sent”data model object (or data sub-model object) to access a datasetrelating to e-mails sent.

Because a data model object is defined by its constraints (e.g., a setof search criteria) and attributes (e.g., a set of fields), a data modelobject can be used to quickly search data to identify a set of eventsand to identify a set of fields to be associated with the set of events.For example, an “e-mails sent” data model object may specify a searchfor events relating to e-mails that have been sent, and specify a set offields that are associated with the events. Thus, a user can retrieveand use the “e-mails sent” data model object to quickly search sourcedata for events relating to sent e-mails, and may be provided with alisting of the set of fields relevant to the events in a user interfacescreen.

Examples of data models can include electronic mail, authentication,databases, intrusion detection, malware, application state, alerts,compute inventory, network sessions, network traffic, performance,audits, updates, vulnerabilities, etc. Data models and their objects canbe designed by knowledge managers in an organization, and they canenable downstream users to quickly focus on a specific set of data. Auser iteratively applies a model development tool (not shown in FIG.24A) to prepare a query that defines a subset of events and assigns anobject name to that subset. A child subset is created by furtherlimiting a query that generated a parent subset.

Data definitions in associated schemas can be taken from the commoninformation model (CIM) or can be devised for a particular schema andoptionally added to the CIM. Child objects inherit fields from parentsand can include fields not present in parents. A model developer canselect fewer extraction rules than are available for the sourcesreturned by the query that defines events belonging to a model.Selecting a limited set of extraction rules can be a tool forsimplifying and focusing the data model, while allowing a userflexibility to explore the data subset. Development of a data model isfurther explained in U.S. Pat. Nos. 8,788,525 and 8,788,526, bothentitled “DATA MODEL FOR MACHINE DATA FOR SEMANTIC SEARCH”, both issuedon 22 Jul. 2014, U.S. Pat. No. 8,983,994, entitled “GENERATION OF A DATAMODEL FOR SEARCHING MACHINE DATA”, issued on 17 Mar. 2015, U.S. Pat. No.9,128,980, entitled “GENERATION OF A DATA MODEL APPLIED TO QUERIES”,issued on 8 Sep. 2015, and U.S. Pat. No. 9,589,012, entitled “GENERATIONOF A DATA MODEL APPLIED TO OBJECT QUERIES”, issued on 7 Mar. 2017, eachof which is hereby incorporated by reference in its entirety for allpurposes.

A data model can also include reports. One or more report formats can beassociated with a particular data model and be made available to runagainst the data model. A user can use child objects to design reportswith object datasets that already have extraneous data pre-filtered out.In some embodiments, the data intake and query system 108 provides theuser with the ability to produce reports (e.g., a table, chart,visualization, etc.) without having to enter SPL, SQL, or other querylanguage terms into a search screen. Data models are used as the basisfor the search feature.

Data models may be selected in a report generation interface. The reportgenerator supports drag-and-drop organization of fields to be summarizedin a report. When a model is selected, the fields with availableextraction rules are made available for use in the report. The user mayrefine and/or filter search results to produce more precise reports. Theuser may select some fields for organizing the report and select otherfields for providing detail according to the report organization. Forexample, “region” and “salesperson” are fields used for organizing thereport and sales data can be summarized (subtotaled and totaled) withinthis organization. The report generator allows the user to specify oneor more fields within events and apply statistical analysis on valuesextracted from the specified one or more fields. The report generatormay aggregate search results across sets of events and generatestatistics based on aggregated search results. Building reports usingthe report generation interface is further explained in U.S. patentapplication Ser. No. 14/503,335, entitled “GENERATING REPORTS FROMUNSTRUCTURED DATA”, filed on 30 Sep. 2014, and which is herebyincorporated by reference in its entirety for all purposes. Datavisualizations also can be generated in a variety of formats, byreference to the data model. Reports, data visualizations, and datamodel objects can be saved and associated with the data model for futureuse. The data model object may be used to perform searches of otherdata.

FIGS. 25-31 are interface diagrams of example report generation userinterfaces, in accordance with example embodiments. The reportgeneration process may be driven by a predefined data model object, suchas a data model object defined and/or saved via a reporting applicationor a data model object obtained from another source. A user can load asaved data model object using a report editor. For example, the initialsearch query and fields used to drive the report editor may be obtainedfrom a data model object. The data model object that is used to drive areport generation process may define a search and a set of fields. Uponloading of the data model object, the report generation process mayenable a user to use the fields (e.g., the fields defined by the datamodel object) to define criteria for a report (e.g., filters, splitrows/columns, aggregates, etc.) and the search may be used to identifyevents (e.g., to identify events responsive to the search) used togenerate the report. That is, for example, if a data model object isselected to drive a report editor, the graphical user interface of thereport editor may enable a user to define reporting criteria for thereport using the fields associated with the selected data model object,and the events used to generate the report may be constrained to theevents that match, or otherwise satisfy, the search constraints of theselected data model object.

The selection of a data model object for use in driving a reportgeneration may be facilitated by a data model object selectioninterface. FIG. 25 illustrates an example interactive data modelselection graphical user interface 2500 of a report editor that displaysa listing of available data models 2501. The user may select one of thedata models 2502.

FIG. 26 illustrates an example data model object selection graphicaluser interface 2600 that displays available data objects 2601 for theselected data object model 2502. The user may select one of thedisplayed data model objects 2602 for use in driving the reportgeneration process.

Once a data model object is selected by the user, a user interfacescreen 2700 shown in FIG. 27A may display an interactive listing ofautomatic field identification options 2701 based on the selected datamodel object. For example, a user may select one of the threeillustrated options (e.g., the “All Fields” option 2702, the “SelectedFields” option 2703, or the “Coverage” option (e.g., fields with atleast a specified % of coverage) 2704). If the user selects the “AllFields” option 2702, all of the fields identified from the events thatwere returned in response to an initial search query may be selected.That is, for example, all of the fields of the identified data modelobject fields may be selected. If the user selects the “Selected Fields”option 2703, only the fields from the fields of the identified datamodel object fields that are selected by the user may be used. If theuser selects the “Coverage” option 2704, only the fields of theidentified data model object fields meeting a specified coveragecriteria may be selected. A percent coverage may refer to the percentageof events returned by the initial search query that a given fieldappears in. Thus, for example, if an object dataset includes 10,000events returned in response to an initial search query, and the“avg_age” field appears in 854 of those 10,000 events, then the“avg_age” field would have a coverage of 8.54% for that object dataset.If, for example, the user selects the “Coverage” option and specifies acoverage value of 2%, only fields having a coverage value equal to orgreater than 2% may be selected. The number of fields corresponding toeach selectable option may be displayed in association with each option.For example, “97” displayed next to the “All Fields” option 2702indicates that 97 fields will be selected if the “All Fields” option isselected. The “3” displayed next to the “Selected Fields” option 2703indicates that 3 of the 97 fields will be selected if the “SelectedFields” option is selected. The “49” displayed next to the “Coverage”option 2704 indicates that 49 of the 97 fields (e.g., the 49 fieldshaving a coverage of 2% or greater) will be selected if the “Coverage”option is selected. The number of fields corresponding to the “Coverage”option may be dynamically updated based on the specified percent ofcoverage.

FIG. 27B illustrates an example graphical user interface screen 2705displaying the reporting application's “Report Editor” page. The screenmay display interactive elements for defining various elements of areport. For example, the page includes a “Filters” element 2706, a“Split Rows” element 2707, a “Split Columns” element 2708, and a “ColumnValues” element 2709. The page may include a list of search results2711. In this example, the Split Rows element 2707 is expanded,revealing a listing of fields 2710 that can be used to define additionalcriteria (e.g., reporting criteria). The listing of fields 2710 maycorrespond to the selected fields. That is, the listing of fields 2710may list only the fields previously selected, either automaticallyand/or manually by a user. FIG. 27C illustrates a formatting dialogue2712 that may be displayed upon selecting a field from the listing offields 2710. The dialogue can be used to format the display of theresults of the selection (e.g., label the column for the selected fieldto be displayed as “component”).

FIG. 27D illustrates an example graphical user interface screen 2705including a table of results 2713 based on the selected criteriaincluding splitting the rows by the “component” field. A column 2714having an associated count for each component listed in the table may bedisplayed that indicates an aggregate count of the number of times thatthe particular field-value pair (e.g., the value in a row for aparticular field, such as the value “BucketMover” for the field“component”) occurs in the set of events responsive to the initialsearch query.

FIG. 28 illustrates an example graphical user interface screen 2800 thatallows the user to filter search results and to perform statisticalanalysis on values extracted from specific fields in the set of events.In this example, the top ten product names ranked by price are selectedas a filter 2801 that causes the display of the ten most popularproducts sorted by price. Each row is displayed by product name andprice 2802. This results in each product displayed in a column labeled“product name” along with an associated price in a column labeled“price” 2806. Statistical analysis of other fields in the eventsassociated with the ten most popular products have been specified ascolumn values 2803. A count of the number of successful purchases foreach product is displayed in column 2804. These statistics may beproduced by filtering the search results by the product name, findingall occurrences of a successful purchase in a field within the eventsand generating a total of the number of occurrences. A sum of the totalsales is displayed in column 2805, which is a result of themultiplication of the price and the number of successful purchases foreach product.

The reporting application allows the user to create graphicalvisualizations of the statistics generated for a report. For example,FIG. 29 illustrates an example graphical user interface 2900 thatdisplays a set of components and associated statistics 2901. Thereporting application allows the user to select a visualization of thestatistics in a graph (e.g., bar chart, scatter plot, area chart, linechart, pie chart, radial gauge, marker gauge, filler gauge, etc.), wherethe format of the graph may be selected using the user interfacecontrols 2902 along the left panel of the user interface 2900. FIG. 30illustrates an example of a bar chart visualization 3000 of an aspect ofthe statistical data 2901. FIG. 31 illustrates a scatter plotvisualization 3100 of an aspect of the statistical data 2901.

4.10. Acceleration Techniques

The above-described system provides significant flexibility by enablinga user to analyze massive quantities of minimally-processed data “on thefly” at search time using a late-binding schema, instead of storingpre-specified portions of the data in a database at ingestion time. Thisflexibility enables a user to see valuable insights, correlate data, andperform subsequent queries to examine interesting aspects of the datathat may not have been apparent at ingestion time.

However, performing extraction and analysis operations at search timecan involve a large amount of data and require a large number ofcomputational operations, which can cause delays in processing thequeries. Advantageously, the data intake and query system 108 alsoemploys a number of unique acceleration techniques that have beendeveloped to speed up analysis operations performed at search time.These techniques include: (1) performing search operations in parallelusing multiple search nodes 506; (2) using a keyword index; (3) using ahigh performance analytics store; and (4) accelerating the process ofgenerating reports. These novel techniques are described in more detailbelow.

4.10.1. Aggregation Technique

To facilitate faster query processing, a query can be structured suchthat multiple search nodes 506 perform the query in parallel, whileaggregation of search results from the multiple search nodes 506 isperformed at the search head 504. For example, FIG. 32 is an examplesearch query received from a client and executed by search nodes 506, inaccordance with example embodiments. FIG. 32 illustrates how a searchquery 3202 received from a client at a search head 504 can split intotwo phases, including: (1) subtasks 3204 (e.g., data retrieval or simplefiltering) that may be performed in parallel by search nodes 506 forexecution, and (2) a search results aggregation operation 3206 to beexecuted by the search head 504 when the results are ultimatelycollected from the search nodes 506.

During operation, upon receiving search query 3202, a search head 504determines that a portion of the operations involved with the searchquery may be performed locally by the search head 504. The search head504 modifies search query 3202 by substituting “stats” (create aggregatestatistics over results sets received from the search nodes 506 at thesearch head 504) with “prestats” (create statistics by the search node506 from local results set) to produce search query 3204, and thendistributes search query 3204 to distributed search nodes 506, which arealso referred to as “search peers” or “peer search nodes.” Note thatsearch queries may generally specify search criteria or operations to beperformed on events that meet the search criteria. Search queries mayalso specify field names, as well as search criteria for the values inthe fields or operations to be performed on the values in the fields.Moreover, the search head 504 may distribute the full search query tothe search peers as illustrated in FIG. 6A, or may alternativelydistribute a modified version (e.g., a more restricted version) of thesearch query to the search peers. In this example, the search nodes 506are responsible for producing the results and sending them to the searchhead 504. After the search nodes 506 return the results to the searchhead 504, the search head 504 aggregates the received results 3206 toform a single search result set. By executing the query in this manner,the system effectively distributes the computational operations acrossthe search nodes 506 while minimizing data transfers.

4.10.2. Keyword Index

As described above with reference to the flow charts in FIG. 5A and FIG.6A, data intake and query system 108 can construct and maintain one ormore keyword indexes to quickly identify events containing specifickeywords. This technique can greatly speed up the processing of queriesinvolving specific keywords. As mentioned above, to build a keywordindex, an indexing node 404 first identifies a set of keywords. Then,the indexing node 404 includes the identified keywords in an index,which associates each stored keyword with references to eventscontaining that keyword, or to locations within events where thatkeyword is located. When the query system 214 subsequently receives akeyword-based query, the indexer can access the keyword index to quicklyidentify events containing the keyword.

4.10.3. High Performance Analytics Store

To speed up certain types of queries, some embodiments of data intakeand query system 108 create a high performance analytics store, which isreferred to as a “summarization table,” that contains entries forspecific field-value pairs. Each of these entries keeps track ofinstances of a specific value in a specific field in the events andincludes references to events containing the specific value in thespecific field. For example, an example entry in a summarization tablecan keep track of occurrences of the value “94107” in a “ZIP code” fieldof a set of events and the entry includes references to all of theevents that contain the value “94107” in the ZIP code field. Thisoptimization technique enables the system to quickly process queriesthat seek to determine how many events have a particular value for aparticular field. To this end, the system can examine the entry in thesummarization table to count instances of the specific value in thefield without having to go through the individual events or perform dataextractions at search time. Also, if the system needs to process allevents that have a specific field-value combination, the system can usethe references in the summarization table entry to directly access theevents to extract further information without having to search all ofthe events to find the specific field-value combination at search time.

In some embodiments, the system maintains a separate summarization tablefor each of the above-described time-specific buckets that stores eventsfor a specific time range. A bucket-specific summarization tableincludes entries for specific field-value combinations that occur inevents in the specific bucket. Alternatively, the system can maintain asummarization table for the common storage 216, one or more data stores218 of the common storage 216, buckets cached on a search node 506, etc.The different summarization tables can include entries for the events inthe common storage 216, certain data stores 218 in the common storage216, or data stores associated with a particular search node 506, etc.

The summarization table can be populated by running a periodic querythat scans a set of events to find instances of a specific field-valuecombination, or alternatively instances of all field-value combinationsfor a specific field. A periodic query can be initiated by a user, orcan be scheduled to occur automatically at specific time intervals. Aperiodic query can also be automatically launched in response to a querythat asks for a specific field-value combination.

In some cases, when the summarization tables may not cover all of theevents that are relevant to a query, the system can use thesummarization tables to obtain partial results for the events that arecovered by summarization tables, but may also have to search throughother events that are not covered by the summarization tables to produceadditional results. These additional results can then be combined withthe partial results to produce a final set of results for the query. Thesummarization table and associated techniques are described in moredetail in U.S. Pat. No. 8,682,925, entitled “DISTRIBUTED HIGHPERFORMANCE ANALYTICS STORE”, issued on 25 Mar. 2014, U.S. Pat. No.9,128,985, entitled “SUPPLEMENTING A HIGH PERFORMANCE ANALYTICS STOREWITH EVALUATION OF INDIVIDUAL EVENTS TO RESPOND TO AN EVENT QUERY”,issued on 8 Sep. 2015, and U.S. patent application Ser. No. 14/815,973,entitled “GENERATING AND STORING SUMMARIZATION TABLES FOR SETS OFSEARCHABLE EVENTS”, filed on 1 Aug. 2015, each of which is herebyincorporated by reference in its entirety for all purposes.

To speed up certain types of queries, e.g., frequently encounteredqueries or computationally intensive queries, some embodiments of dataintake and query system 108 create a high performance analytics store,which is referred to as a “summarization table,” (also referred to as a“lexicon” or “inverted index”) that contains entries for specificfield-value pairs. Each of these entries keeps track of instances of aspecific value in a specific field in the event data and includesreferences to events containing the specific value in the specificfield. For example, an example entry in an inverted index can keep trackof occurrences of the value “94107” in a “ZIP code” field of a set ofevents and the entry includes references to all of the events thatcontain the value “94107” in the ZIP code field. Creating the invertedindex data structure avoids needing to incur the computational overheadeach time a statistical query needs to be run on a frequentlyencountered field-value pair. In order to expedite queries, in certainembodiments, the query system 214 can employ the inverted index separatefrom the raw record data store to generate responses to the receivedqueries.

Note that the term “summarization table” or “inverted index” as usedherein is a data structure that may be generated by the indexing system212 that includes at least field names and field values that have beenextracted and/or indexed from event records. An inverted index may alsoinclude reference values that point to the location(s) in the fieldsearchable data store where the event records that include the field maybe found. Also, an inverted index may be stored using variouscompression techniques to reduce its storage size.

Further, note that the term “reference value” (also referred to as a“posting value”) as used herein is a value that references the locationof a source record in the field searchable data store. In someembodiments, the reference value may include additional informationabout each record, such as timestamps, record size, meta-data, or thelike. Each reference value may be a unique identifier which may be usedto access the event data directly in the field searchable data store. Insome embodiments, the reference values may be ordered based on eachevent record's timestamp. For example, if numbers are used asidentifiers, they may be sorted so event records having a latertimestamp always have a lower valued identifier than event records withan earlier timestamp, or vice-versa. Reference values are often includedin inverted indexes for retrieving and/or identifying event records.

In one or more embodiments, an inverted index is generated in responseto a user-initiated collection query. The term “collection query” asused herein refers to queries that include commands that generatesummarization information and inverted indexes (or summarization tables)from event records stored in the field searchable data store.

Note that a collection query is a special type of query that can beuser-generated and is used to create an inverted index. A collectionquery is not the same as a query that is used to call up or invoke apre-existing inverted index. In one or more embodiments, a query cancomprise an initial step that calls up a pre-generated inverted index onwhich further filtering and processing can be performed. For example,referring back to FIG. 22B, a set of events can be generated at block2240 by either using a “collection” query to create a new inverted indexor by calling up a pre-generated inverted index. A query with severalpipelined steps will start with a pre-generated index to accelerate thequery.

FIG. 23C illustrates the manner in which an inverted index is createdand used in accordance with the disclosed embodiments. As shown in FIG.23C, an inverted index 2322 can be created in response to auser-initiated collection query using the event data 2323 stored in theraw record data store. For example, a non-limiting example of acollection query may include “collect clientip=127.0.0.1” which mayresult in an inverted index 2322 being generated from the event data2323 as shown in FIG. 23C. Each entry in inverted index 2322 includes anevent reference value that references the location of a source record inthe field searchable data store. The reference value may be used toaccess the original event record directly from the field searchable datastore.

In one or more embodiments, if one or more of the queries is acollection query, the one or more search nodes 506 may generatesummarization information based on the fields of the event recordslocated in the field searchable data store. In at least one of thevarious embodiments, one or more of the fields used in the summarizationinformation may be listed in the collection query and/or they may bedetermined based on terms included in the collection query. For example,a collection query may include an explicit list of fields to summarize.Or, in at least one of the various embodiments, a collection query mayinclude terms or expressions that explicitly define the fields, e.g.,using regex rules. In FIG. 23C, prior to running the collection querythat generates the inverted index 2322, the field name “clientip” mayneed to be defined in a configuration file by specifying the“access_combined” source type and a regular expression rule to parse outthe client IP address. Alternatively, the collection query may containan explicit definition for the field name “clientip” which may obviatethe need to reference the configuration file at search time.

In one or more embodiments, collection queries may be saved andscheduled to run periodically. These scheduled collection queries mayperiodically update the summarization information corresponding to thequery. For example, if the collection query that generates invertedindex 2322 is scheduled to run periodically, one or more search nodes506 can periodically search through the relevant buckets to updateinverted index 2322 with event data for any new events with the“clientip” value of “127.0.0.1.”

In some embodiments, the inverted indexes that include fields, values,and reference value (e.g., inverted index 2322) for event records may beincluded in the summarization information provided to the user. In otherembodiments, a user may not be interested in specific fields and valuescontained in the inverted index, but may need to perform a statisticalquery on the data in the inverted index. For example, referencing theexample of FIG. 23C rather than viewing the fields within the invertedindex 2322, a user may want to generate a count of all client requestsfrom IP address “127.0.0.1.” In this case, the query system 214 cansimply return a result of “4” rather than including details about theinverted index 2322 in the information provided to the user.

The pipelined search language, e.g., SPL of the SPLUNK® ENTERPRISEsystem can be used to pipe the contents of an inverted index to astatistical query using the “stats” command for example. A “stats” queryrefers to queries that generate result sets that may produce aggregateand statistical results from event records, e.g., average, mean, max,min, rms, etc. Where sufficient information is available in an invertedindex, a “stats” query may generate their result sets rapidly from thesummarization information available in the inverted index rather thandirectly scanning event records. For example, the contents of invertedindex 2322 can be pipelined to a stats query, e.g., a “count” functionthat counts the number of entries in the inverted index and returns avalue of “4.” In this way, inverted indexes may enable various statsqueries to be performed absent scanning or search the event records.Accordingly, this optimization technique enables the system to quicklyprocess queries that seek to determine how many events have a particularvalue for a particular field. To this end, the system can examine theentry in the inverted index to count instances of the specific value inthe field without having to go through the individual events or performdata extractions at search time.

In some embodiments, the system maintains a separate inverted index foreach of the above-described time-specific buckets that stores events fora specific time range. A bucket-specific inverted index includes entriesfor specific field-value combinations that occur in events in thespecific bucket. Alternatively, the system can maintain a separateinverted index for one or more data stores 218 of common storage 216, anindexing node 404, or a search node 506. The specific inverted indexescan include entries for the events in the one or more data stores 218 ordata store associated with the indexing nodes 404 or search node 506. Insome embodiments, if one or more of the queries is a stats query, asearch node 506 can generate a partial result set from previouslygenerated summarization information. The partial result sets may bereturned to the search head 504 that received the query and combinedinto a single result set for the query

As mentioned above, the inverted index can be populated by running aperiodic query that scans a set of events to find instances of aspecific field-value combination, or alternatively instances of allfield-value combinations for a specific field. A periodic query can beinitiated by a user, or can be scheduled to occur automatically atspecific time intervals. A periodic query can also be automaticallylaunched in response to a query that asks for a specific field-valuecombination. In some embodiments, if summarization information is absentfrom a search node 506 that includes responsive event records, furtheractions may be taken, such as, the summarization information maygenerated on the fly, warnings may be provided the user, the collectionquery operation may be halted, the absence of summarization informationmay be ignored, or the like, or combination thereof.

In one or more embodiments, an inverted index may be set up to updatecontinually. For example, the query may ask for the inverted index toupdate its result periodically, e.g., every hour. In such instances, theinverted index may be a dynamic data structure that is regularly updatedto include information regarding incoming events.

4.10.3.1. Extracting Event Data Using Posting

In one or more embodiments, if the system needs to process all eventsthat have a specific field-value combination, the system can use thereferences in the inverted index entry to directly access the events toextract further information without having to search all of the eventsto find the specific field-value combination at search time. In otherwords, the system can use the reference values to locate the associatedevent data in the field searchable data store and extract furtherinformation from those events, e.g., extract further field values fromthe events for purposes of filtering or processing or both.

The information extracted from the event data using the reference valuescan be directed for further filtering or processing in a query using thepipeline search language. The pipelined search language will, in oneembodiment, include syntax that can direct the initial filtering step ina query to an inverted index. In one embodiment, a user would includesyntax in the query that explicitly directs the initial searching orfiltering step to the inverted index.

Referencing the example in FIG. 31 , if the user determines that sheneeds the user id fields associated with the client requests from IPaddress “127.0.0.1,” instead of incurring the computational overhead ofperforming a brand new search or re-generating the inverted index withan additional field, the user can generate a query that explicitlydirects or pipes the contents of the already generated inverted index2322 to another filtering step requesting the user ids for the entriesin inverted index 2322 where the server response time is greater than“0.0900” microseconds. The query system 214 can use the reference valuesstored in inverted index 2322 to retrieve the event data from the fieldsearchable data store, filter the results based on the “response time”field values and, further, extract the user id field from the resultingevent data to return to the user. In the present instance, the user ids“frank” and “carlos” would be returned to the user from the generatedresults table 2325.

In one embodiment, the same methodology can be used to pipe the contentsof the inverted index to a processing step. In other words, the user isable to use the inverted index to efficiently and quickly performaggregate functions on field values that were not part of the initiallygenerated inverted index. For example, a user may want to determine anaverage object size (size of the requested gif) requested by clientsfrom IP address “127.0.0.1.” In this case, the query system 214 canagain use the reference values stored in inverted index 2322 to retrievethe event data from the field searchable data store and, further,extract the object size field values from the associated events 2331,2332, 2333 and 2334. Once, the corresponding object sizes have beenextracted (i.e. 2326, 2900, 2920, and 5000), the average can be computedand returned to the user.

In one embodiment, instead of explicitly invoking the inverted index ina user-generated query, e.g., by the use of special commands or syntax,the SPLUNK® ENTERPRISE system can be configured to automaticallydetermine if any prior-generated inverted index can be used to expeditea user query. For example, the user's query may request the averageobject size (size of the requested gif) requested by clients from IPaddress “127.0.0.1.” without any reference to or use of inverted index2322. The query system 214, in this case, can automatically determinethat an inverted index 2322 already exists in the system that couldexpedite this query. In one embodiment, prior to running any searchcomprising a field-value pair, for example, a query system 214 cansearch though all the existing inverted indexes to determine if apre-generated inverted index could be used to expedite the searchcomprising the field-value pair. Accordingly, the query system 214 canautomatically use the pre-generated inverted index, e.g., index 2322 togenerate the results without any user-involvement that directs the useof the index.

Using the reference values in an inverted index to be able to directlyaccess the event data in the field searchable data store and extractfurther information from the associated event data for further filteringand processing is highly advantageous because it avoids incurring thecomputation overhead of regenerating the inverted index with additionalfields or performing a new search.

The data intake and query system 108 includes an intake system 210 thatreceives data from a variety of input data sources, and an indexingsystem 212 that processes and stores the data in one or more data storesor common storage 216. By distributing events among the data stores 218of common storage 213, the query system 214 can analyze events for aquery in parallel. In some embodiments, the data intake and query system108 can maintain a separate and respective inverted index for each ofthe above-described time-specific buckets that stores events for aspecific time range. A bucket-specific inverted index includes entriesfor specific field-value combinations that occur in events in thespecific bucket. As explained above, a search head 504 can correlate andsynthesize data from across the various buckets and search nodes 506.

This feature advantageously expedites searches because instead ofperforming a computationally intensive search in a centrally locatedinverted index that catalogues all the relevant events, a search node506 is able to directly search an inverted index stored in a bucketassociated with the time-range specified in the query. This allows thesearch to be performed in parallel across the various search nodes 506.Further, if the query requests further filtering or processing to beconducted on the event data referenced by the locally storedbucket-specific inverted index, the search node 506 is able to simplyaccess the event records stored in the associated bucket for furtherfiltering and processing instead of needing to access a centralrepository of event records, which would dramatically add to thecomputational overhead.

In one embodiment, there may be multiple buckets associated with thetime-range specified in a query. If the query is directed to an invertedindex, or if the query system 214 automatically determines that using aninverted index can expedite the processing of the query, the searchnodes 506 can search through each of the inverted indexes associatedwith the buckets for the specified time-range. This feature allows theHigh Performance Analytics Store to be scaled easily.

FIG. 23D is a flow diagram illustrating an embodiment of a routineimplemented by one or more computing devices of the data intake andquery system for using an inverted index in a pipelined search query todetermine a set of event data that can be further limited by filteringor processing. For example, the routine can be implemented by any one orany combination of the search head 504, search node 506, search master512, or search manager 514, etc. However, for simplicity, referencebelow is made to the query system 214 performing the various steps ofthe routine.

At block 2342, a query is received by a data intake and query system108. In some embodiments, the query can be received as a user generatedquery entered into search bar of a graphical user search interface. Thesearch interface also includes a time range control element that enablesspecification of a time range for the query.

At block 2344, an inverted index is retrieved. Note, that the invertedindex can be retrieved in response to an explicit user search commandinputted as part of the user generated query. Alternatively, a querysystem 215 can be configured to automatically use an inverted index ifit determines that using the inverted index would expedite the servicingof the user generated query. Each of the entries in an inverted indexkeeps track of instances of a specific value in a specific field in theevent data and includes references to events containing the specificvalue in the specific field. In order to expedite queries, in someembodiments, the query system 214 employs the inverted index separatefrom the raw record data store to generate responses to the receivedqueries.

At block 2346, the query system 214 determines if the query containsfurther filtering and processing steps. If the query contains no furthercommands, then, in one embodiment, summarization information can beprovided to the user at block 2354.

If, however, the query does contain further filtering and processingcommands, then at block 2348, the query system 214 determines if thecommands relate to further filtering or processing of the data extractedas part of the inverted index or whether the commands are directed tousing the inverted index as an initial filtering step to further filterand process event data referenced by the entries in the inverted index.If the query can be completed using data already in the generatedinverted index, then the further filtering or processing steps, e.g., a“count” number of records function, “average” number of records per houretc. are performed and the results are provided to the user at block2350.

If, however, the query references fields that are not extracted in theinverted index, the query system 214 can access event data pointed to bythe reference values in the inverted index to retrieve any furtherinformation required at block 2356. Subsequently, any further filteringor processing steps are performed on the fields extracted directly fromthe event data and the results are provided to the user at step 2358.

4.10.4. Accelerating Report Generation

In some embodiments, a data server system such as the data intake andquery system 108 can accelerate the process of periodically generatingupdated reports based on query results. To accelerate this process, asummarization engine can automatically examine the query to determinewhether generation of updated reports can be accelerated by creatingintermediate summaries. If reports can be accelerated, the summarizationengine periodically generates a summary covering data obtained during alatest non-overlapping time period. For example, where the query seeksevents meeting a specified criteria, a summary for the time periodincludes may only events within the time period that meet the specifiedcriteria. Similarly, if the query seeks statistics calculated from theevents, such as the number of events that match the specified criteria,then the summary for the time period includes the number of events inthe period that match the specified criteria.

In addition to the creation of the summaries, the summarization engineschedules the periodic updating of the report associated with the query.During each scheduled report update, the query system 214 determineswhether intermediate summaries have been generated covering portions ofthe time period covered by the report update. If so, then the report isgenerated based on the information contained in the summaries. Also, ifadditional event data has been received and has not yet been summarized,and is required to generate the complete report, the query can be run onthese additional events. Then, the results returned by this query on theadditional events, along with the partial results obtained from theintermediate summaries, can be combined to generate the updated report.This process is repeated each time the report is updated. Alternatively,if the system stores events in buckets covering specific time ranges,then the summaries can be generated on a bucket-by-bucket basis. Notethat producing intermediate summaries can save the work involved inre-running the query for previous time periods, so advantageously onlythe newer events needs to be processed while generating an updatedreport. These report acceleration techniques are described in moredetail in U.S. Pat. No. 8,589,403, entitled “COMPRESSED JOURNALING INEVENT TRACKING FILES FOR METADATA RECOVERY AND REPLICATION”, issued on19 Nov. 2013, U.S. Pat. No. 8,412,696, entitled “REAL TIME SEARCHING ANDREPORTING”, issued on 2 Apr. 2011, and U.S. Pat. Nos. 8,589,375 and8,589,432, both also entitled “REAL TIME SEARCHING AND REPORTING”, bothissued on 19 Nov. 2013, each of which is hereby incorporated byreference in its entirety for all purposes.

4.12. Security Features

The data intake and query system 108 provides various schemas,dashboards, and visualizations that simplify developers' tasks to createapplications with additional capabilities. One such application is thean enterprise security application, such as SPLUNK® ENTERPRISE SECURITY,which performs monitoring and alerting operations and includes analyticsto facilitate identifying both known and unknown security threats basedon large volumes of data stored by the data intake and query system 108.The enterprise security application provides the security practitionerwith visibility into security-relevant threats found in the enterpriseinfrastructure by capturing, monitoring, and reporting on data fromenterprise security devices, systems, and applications. Through the useof the data intake and query system 108 searching and reportingcapabilities, the enterprise security application provides a top-downand bottom-up view of an organization's security posture.

The enterprise security application leverages the data intake and querysystem 108 search-time normalization techniques, saved searches, andcorrelation searches to provide visibility into security-relevantthreats and activity and generate notable events for tracking. Theenterprise security application enables the security practitioner toinvestigate and explore the data to find new or unknown threats that donot follow signature-based patterns.

Conventional Security Information and Event Management (SIEM) systemslack the infrastructure to effectively store and analyze large volumesof security-related data. Traditional SIEM systems typically use fixedschemas to extract data from pre-defined security-related fields at dataingestion time and store the extracted data in a relational database.This traditional data extraction process (and associated reduction indata size) that occurs at data ingestion time inevitably hampers futureincident investigations that may need original data to determine theroot cause of a security issue, or to detect the onset of an impendingsecurity threat.

In contrast, the enterprise security application system stores largevolumes of minimally-processed security-related data at ingestion timefor later retrieval and analysis at search time when a live securitythreat is being investigated. To facilitate this data retrieval process,the enterprise security application provides pre-specified schemas forextracting relevant values from the different types of security-relatedevents and enables a user to define such schemas.

The enterprise security application can process many types ofsecurity-related information. In general, this security-relatedinformation can include any information that can be used to identifysecurity threats. For example, the security-related information caninclude network-related information, such as IP addresses, domain names,asset identifiers, network traffic volume, uniform resource locatorstrings, and source addresses. The process of detecting security threatsfor network-related information is further described in U.S. Pat. No.8,826,434, entitled “SECURITY THREAT DETECTION BASED ON INDICATIONS INBIG DATA OF ACCESS TO NEWLY REGISTERED DOMAINS”, issued on 2 Sep. 2014,U.S. Pat. No. 9,215,240, entitled “INVESTIGATIVE AND DYNAMIC DETECTIONOF POTENTIAL SECURITY-THREAT INDICATORS FROM EVENTS IN BIG DATA”, issuedon 15 Dec. 2015, U.S. Pat. No. 9,173,801, entitled “GRAPHIC DISPLAY OFSECURITY THREATS BASED ON INDICATIONS OF ACCESS TO NEWLY REGISTEREDDOMAINS”, issued on 3 Nov. 2015, U.S. Pat. No. 9,248,068, entitled“SECURITY THREAT DETECTION OF NEWLY REGISTERED DOMAINS”, issued on 2Feb. 2016, U.S. Pat. No. 9,426,172, entitled “SECURITY THREAT DETECTIONUSING DOMAIN NAME ACCESSES”, issued on 23 Aug. 2016, and U.S. Pat. No.9,432,396, entitled “SECURITY THREAT DETECTION USING DOMAIN NAMEREGISTRATIONS”, issued on 30 Aug. 2016, each of which is herebyincorporated by reference in its entirety for all purposes.Security-related information can also include malware infection data andsystem configuration information, as well as access control information,such as login/logout information and access failure notifications. Thesecurity-related information can originate from various sources within adata center, such as hosts, virtual machines, storage devices andsensors. The security-related information can also originate fromvarious sources in a network, such as routers, switches, email servers,proxy servers, gateways, firewalls and intrusion-detection systems.

During operation, the enterprise security application facilitatesdetecting “notable events” that are likely to indicate a securitythreat. A notable event represents one or more anomalous incidents, theoccurrence of which can be identified based on one or more events (e.g.,time stamped portions of raw machine data) fulfilling pre-specifiedand/or dynamically-determined (e.g., based on machine-learning) criteriadefined for that notable event. Examples of notable events include therepeated occurrence of an abnormal spike in network usage over a periodof time, a single occurrence of unauthorized access to system, a hostcommunicating with a server on a known threat list, and the like. Thesenotable events can be detected in a number of ways, such as: (1) a usercan notice a correlation in events and can manually identify that acorresponding group of one or more events amounts to a notable event; or(2) a user can define a “correlation search” specifying criteria for anotable event, and every time one or more events satisfy the criteria,the application can indicate that the one or more events correspond to anotable event; and the like. A user can alternatively select apre-defined correlation search provided by the application. Note thatcorrelation searches can be run continuously or at regular intervals(e.g., every hour) to search for notable events. Upon detection, notableevents can be stored in a dedicated “notable events index,” which can besubsequently accessed to generate various visualizations containingsecurity-related information. Also, alerts can be generated to notifysystem operators when important notable events are discovered.

The enterprise security application provides various visualizations toaid in discovering security threats, such as a “key indicators view”that enables a user to view security metrics, such as counts ofdifferent types of notable events. For example, FIG. 33A illustrates anexample key indicators view 3300 that comprises a dashboard, which candisplay a value 3301, for various security-related metrics, such asmalware infections 3302. It can also display a change in a metric value3303, which indicates that the number of malware infections increased by63 during the preceding interval. Key indicators view 3300 additionallydisplays a histogram panel 3304 that displays a histogram of notableevents organized by urgency values, and a histogram of notable eventsorganized by time intervals. This key indicators view is described infurther detail in pending U.S. patent application Ser. No. 13/956,338,entitled “KEY INDICATORS VIEW”, filed on 31 Jul. 2013, and which ishereby incorporated by reference in its entirety for all purposes.

These visualizations can also include an “incident review dashboard”that enables a user to view and act on “notable events.” These notableevents can include: (1) a single event of high importance, such as anyactivity from a known web attacker; or (2) multiple events thatcollectively warrant review, such as a large number of authenticationfailures on a host followed by a successful authentication. For example,FIG. 33B illustrates an example incident review dashboard 3310 thatincludes a set of incident attribute fields 3311 that, for example,enables a user to specify a time range field 3312 for the displayedevents. It also includes a timeline 3313 that graphically illustratesthe number of incidents that occurred in time intervals over theselected time range. It additionally displays an events list 3314 thatenables a user to view a list of all of the notable events that matchthe criteria in the incident attributes fields 3311. To facilitateidentifying patterns among the notable events, each notable event can beassociated with an urgency value (e.g., low, medium, high, critical),which is indicated in the incident review dashboard. The urgency valuefor a detected event can be determined based on the severity of theevent and the priority of the system component associated with theevent.

4.13. Data Center Monitoring

As mentioned above, the data intake and query platform provides variousfeatures that simplify the developer's task to create variousapplications. One such application is a virtual machine monitoringapplication, such as SPLUNK® APP FOR VMWARE® that provides operationalvisibility into granular performance metrics, logs, tasks and events,and topology from hosts, virtual machines and virtual centers. Itempowers administrators with an accurate real-time picture of the healthof the environment, proactively identifying performance and capacitybottlenecks.

Conventional data-center-monitoring systems lack the infrastructure toeffectively store and analyze large volumes of machine-generated data,such as performance information and log data obtained from the datacenter. In conventional data-center-monitoring systems,machine-generated data is typically pre-processed prior to being stored,for example, by extracting pre-specified data items and storing them ina database to facilitate subsequent retrieval and analysis at searchtime. However, the rest of the data is not saved and discarded duringpre-processing.

In contrast, the virtual machine monitoring application stores largevolumes of minimally processed machine data, such as performanceinformation and log data, at ingestion time for later retrieval andanalysis at search time when a live performance issue is beinginvestigated. In addition to data obtained from various log files, thisperformance-related information can include values for performancemetrics obtained through an application programming interface (API)provided as part of the vSphere Hypervisor™ system distributed byVMware, Inc. of Palo Alto, Calif. For example, these performance metricscan include: (1) CPU-related performance metrics; (2) disk-relatedperformance metrics; (3) memory-related performance metrics; (4)network-related performance metrics; (5) energy-usage statistics; (6)data-traffic-related performance metrics; (7) overall systemavailability performance metrics; (8) cluster-related performancemetrics; and (9) virtual machine performance statistics. Suchperformance metrics are described in U.S. patent application Ser. No.14/167,316, entitled “CORRELATION FOR USER-SELECTED TIME RANGES OFVALUES FOR PERFORMANCE METRICS OF COMPONENTS IN ANINFORMATION-TECHNOLOGY ENVIRONMENT WITH LOG DATA FROM THATINFORMATION-TECHNOLOGY ENVIRONMENT”, filed on 29 Jan. 2014, and which ishereby incorporated by reference in its entirety for all purposes.

To facilitate retrieving information of interest from performance dataand log files, the virtual machine monitoring application providespre-specified schemas for extracting relevant values from differenttypes of performance-related events, and also enables a user to definesuch schemas.

The virtual machine monitoring application additionally provides variousvisualizations to facilitate detecting and diagnosing the root cause ofperformance problems. For example, one such visualization is a“proactive monitoring tree” that enables a user to easily view andunderstand relationships among various factors that affect theperformance of a hierarchically structured computing system. Thisproactive monitoring tree enables a user to easily navigate thehierarchy by selectively expanding nodes representing various entities(e.g., virtual centers or computing clusters) to view performanceinformation for lower-level nodes associated with lower-level entities(e.g., virtual machines or host systems). Example node-expansionoperations are illustrated in FIG. 33C, wherein nodes 3333 and 3334 areselectively expanded. Note that nodes 3331-3339 can be displayed usingdifferent patterns or colors to represent different performance states,such as a critical state, a warning state, a normal state or anunknown/offline state. The ease of navigation provided by selectiveexpansion in combination with the associated performance-stateinformation enables a user to quickly diagnose the root cause of aperformance problem. The proactive monitoring tree is described infurther detail in U.S. Pat. No. 9,185,007, entitled “PROACTIVEMONITORING TREE WITH SEVERITY STATE SORTING”, issued on 10 Nov. 2015,and U.S. Pat. No. 9,426,045, also entitled “PROACTIVE MONITORING TREEWITH SEVERITY STATE SORTING”, issued on 23 Aug. 2016, each of which ishereby incorporated by reference in its entirety for all purposes.

The virtual machine monitoring application also provides a userinterface that enables a user to select a specific time range and thenview heterogeneous data comprising events, log data, and associatedperformance metrics for the selected time range. For example, the screenillustrated in FIG. 33D displays a listing of recent “tasks and events”and a listing of recent “log entries” for a selected time range above aperformance-metric graph for “average CPU core utilization” for theselected time range. Note that a user is able to operate pull-down menus3342 to selectively display different performance metric graphs for theselected time range. This enables the user to correlate trends in theperformance-metric graph with corresponding event and log data toquickly determine the root cause of a performance problem. This userinterface is described in more detail in U.S. patent application Ser.No. 14/167,316, entitled “CORRELATION FOR USER-SELECTED TIME RANGES OFVALUES FOR PERFORMANCE METRICS OF COMPONENTS IN ANINFORMATION-TECHNOLOGY ENVIRONMENT WITH LOG DATA FROM THATINFORMATION-TECHNOLOGY ENVIRONMENT”, filed on 29 Jan. 2014, and which ishereby incorporated by reference in its entirety for all purposes.

4.14. IT Service Monitoring

As previously mentioned, the data intake and query platform providesvarious schemas, dashboards and visualizations that make it easy fordevelopers to create applications to provide additional capabilities.One such application is an IT monitoring application, such as SPLUNK® ITSERVICE INTELLIGENCE™, which performs monitoring and alertingoperations. The IT monitoring application also includes analytics tohelp an analyst diagnose the root cause of performance problems based onlarge volumes of data stored by the data intake and query system 108 ascorrelated to the various services an IT organization provides (aservice-centric view). This differs significantly from conventional ITmonitoring systems that lack the infrastructure to effectively store andanalyze large volumes of service-related events. Traditional servicemonitoring systems typically use fixed schemas to extract data frompre-defined fields at data ingestion time, wherein the extracted data istypically stored in a relational database. This data extraction processand associated reduction in data content that occurs at data ingestiontime inevitably hampers future investigations, when all of the originaldata may be needed to determine the root cause of or contributingfactors to a service issue.

In contrast, an IT monitoring application system stores large volumes ofminimally-processed service-related data at ingestion time for laterretrieval and analysis at search time, to perform regular monitoring, orto investigate a service issue. To facilitate this data retrievalprocess, the IT monitoring application enables a user to define an IToperations infrastructure from the perspective of the services itprovides. In this service-centric approach, a service such as corporatee-mail may be defined in terms of the entities employed to provide theservice, such as host machines and network devices. Each entity isdefined to include information for identifying all of the events thatpertains to the entity, whether produced by the entity itself or byanother machine, and considering the many various ways the entity may beidentified in machine data (such as by a URL, an IP address, or machinename). The service and entity definitions can organize events around aservice so that all of the events pertaining to that service can beeasily identified. This capability provides a foundation for theimplementation of Key Performance Indicators.

One or more Key Performance Indicators (KPI's) are defined for a servicewithin the IT monitoring application . Each KPI measures an aspect ofservice performance at a point in time or over a period of time (aspectKPI's). Each KPI is defined by a search query that derives a KPI valuefrom the machine data of events associated with the entities thatprovide the service. Information in the entity definitions may be usedto identify the appropriate events at the time a KPI is defined orwhenever a KPI value is being determined. The KPI values derived overtime may be stored to build a valuable repository of current andhistorical performance information for the service, and the repository,itself, may be subject to search query processing. Aggregate KPIs may bedefined to provide a measure of service performance calculated from aset of service aspect KPI values; this aggregate may even be takenacross defined timeframes and/or across multiple services. A particularservice may have an aggregate KPI derived from substantially all of theaspect KPI's of the service to indicate an overall health score for theservice.

The IT monitoring application facilitates the production of meaningfulaggregate KPI's through a system of KPI thresholds and state values.Different KPI definitions may produce values in different ranges, and sothe same value may mean something very different from one KPI definitionto another. To address this, the IT monitoring application implements atranslation of individual KPI values to a common domain of “state”values. For example, a KPI range of values may be 1-100, or 50-275,while values in the state domain may be ‘critical,’ ‘warning,’ ‘normal,’and ‘informational’. Thresholds associated with a particular KPIdefinition determine ranges of values for that KPI that correspond tothe various state values. In one case, KPI values 95-100 may be set tocorrespond to ‘critical’ in the state domain KPI values from disparateKPI's can be processed uniformly once they are translated into thecommon state values using the thresholds. For example, “normal 80% ofthe time” can be applied across various KPI's. To provide meaningfulaggregate KPI's, a weighting value can be assigned to each KPI so thatits influence on the calculated aggregate KPI value is increased ordecreased relative to the other KPI's.

One service in an IT environment often impacts, or is impacted by,another service. The IT monitoring application can reflect thesedependencies. For example, a dependency relationship between a corporatee-mail service and a centralized authentication service can be reflectedby recording an association between their respective servicedefinitions. The recorded associations establish a service dependencytopology that informs the data or selection options presented in a GUI,for example. (The service dependency topology is like a “map” showinghow services are connected based on their dependencies.) The servicetopology may itself be depicted in a GUI and may be interactive to allownavigation among related services.

Entity definitions in the IT monitoring application can includeinformational fields that can serve as metadata, implied data fields, orattributed data fields for the events identified by other aspects of theentity definition. Entity definitions in the IT monitoring applicationcan also be created and updated by an import of tabular data (asrepresented in a CSV, another delimited file, or a search query resultset). The import may be GUI-mediated or processed using importparameters from a GUI-based import definition process. Entitydefinitions in the IT monitoring application can also be associated witha service by means of a service definition rule. Processing the ruleresults in the matching entity definitions being associated with theservice definition. The rule can be processed at creation time, andthereafter on a scheduled or on-demand basis. This allows dynamic,rule-based updates to the service definition.

During operation, the IT monitoring application can recognize notableevents that may indicate a service performance problem or othersituation of interest. These notable events can be recognized by a“correlation search” specifying trigger criteria for a notable event:every time KPI values satisfy the criteria, the application indicates anotable event. A severity level for the notable event may also bespecified. Furthermore, when trigger criteria are satisfied, thecorrelation search may additionally or alternatively cause a serviceticket to be created in an IT service management (ITSM) system, such asa systems available from ServiceNow, Inc., of Santa Clara, Calif.

SPLUNK® IT SERVICE INTELLIGENCE™ provides various visualizations builton its service-centric organization of events and the KPI valuesgenerated and collected. Visualizations can be particularly useful formonitoring or investigating service performance The IT monitoringapplication provides a service monitoring interface suitable as the homepage for ongoing IT service monitoring. The interface is appropriate forsettings such as desktop use or for a wall-mounted display in a networkoperations center (NOC). The interface may prominently display aservices health section with tiles for the aggregate KPI' s indicatingoverall health for defined services and a general KPI section with tilesfor KPI's related to individual service aspects. These tiles may displayKPI information in a variety of ways, such as by being colored andordered according to factors like the KPI state value. They also can beinteractive and navigate to visualizations of more detailed KPIinformation.

The IT monitoring application provides a service-monitoring dashboardvisualization based on a user-defined template. The template can includeuser-selectable widgets of varying types and styles to display KPIinformation. The content and the appearance of widgets can responddynamically to changing KPI information. The KPI widgets can appear inconjunction with a background image, user drawing objects, or othervisual elements, that depict the IT operations environment, for example.The KPI widgets or other GUI elements can be interactive so as toprovide navigation to visualizations of more detailed KPI information.

The IT monitoring application provides a visualization showing detailedtime-series information for multiple KPI's in parallel graph lanes. Thelength of each lane can correspond to a uniform time range, while thewidth of each lane may be automatically adjusted to fit the displayedKPI data. Data within each lane may be displayed in a user selectablestyle, such as a line, area, or bar chart. During operation a user mayselect a position in the time range of the graph lanes to activate laneinspection at that point in time. Lane inspection may display anindicator for the selected time across the graph lanes and display theKPI value associated with that point in time for each of the graphlanes. The visualization may also provide navigation to an interface fordefining a correlation search, using information from the visualizationto pre-populate the definition.

The IT monitoring application provides a visualization for incidentreview showing detailed information for notable events. The incidentreview visualization may also show summary information for the notableevents over a time frame, such as an indication of the number of notableevents at each of a number of severity levels. The severity leveldisplay may be presented as a rainbow chart with the warmest colorassociated with the highest severity classification. The incident reviewvisualization may also show summary information for the notable eventsover a time frame, such as the number of notable events occurring withinsegments of the time frame. The incident review visualization maydisplay a list of notable events within the time frame ordered by anynumber of factors, such as time or severity. The selection of aparticular notable event from the list may display detailed informationabout that notable event, including an identification of the correlationsearch that generated the notable event.

The IT monitoring application provides pre-specified schemas forextracting relevant values from the different types of service-relatedevents. It also enables a user to define such schemas.

4.15. Anomaly Detection

As detailed above, data may be ingested at the data intake and querysystem 108 through an intake system 210 configured to conductpreliminary processing on the data, and make the data available todownstream systems or components, such as the indexing system 212, querysystem 214, third party systems, etc. In some cases, there may beerrors, anomalies, or other issues with the ingested data. Typically,such errors, anomalies, or other issues may be surfaced by anadministrator after the data has been ingested, processed, and madeavailable to downstream systems or components (e.g., after the ingesteddata has already been indexed and stored in common storage 216, afterthe ingested data is searchable by the query system 214, etc.). Inparticular, the errors, anomalies, or other issues may be identified bythe administrator when performing a query on historical, stored data.Identifying the errors, anomalies, or other issues at this stage,however, may be too late to resolve the underlying cause of these issuesor to prevent such issues from occurring in the future. In fact, theseissues may not even be surfaced unless the administrator activelyperforms a query or otherwise attempts to investigate thecharacteristics of indexed and stored data.

In other cases, there may be errors, anomalies, or other issues with thedata ingestion pipeline itself. For example, the underlying data beingingested may be normal. However, there may be something wrong with theprogram that is running the data ingestion pipeline. Such issues caninclude a deployment error (e.g., there is a version mismatch betweenvarious components that execute operations to run the data ingestion),the environment restarting (and therefore certain components thatexecute operations to run the data ingestion being unavailable), aconfiguration error, components that execute operations to run the dataingestion being swapped with other components such that the swapped-incomponents are incompatible or cause the existing components to fail,services supporting the components that execute operations to run thedata ingestion failing, an authentication mechanism associated with thedata ingestion failing, and/or the like.

Typically, an administrator may randomly detect issues with the dataingestion pipeline via a manual inspection. The administrator can createa rule with hardcoded thresholds (e.g., set parameters) that describethe previously-detected data ingestion pipeline issue such that an alertcan be generated if the same data ingestion pipeline issue resurfaces.However, such rules are not capable of detecting new types of dataingestion pipeline issues, such as those that have not been detectedbefore. In addition, a data ingestion pipeline can be present inenvironments of different sizes and can have a varying number ofcomponents. The hardcoded thresholds of a rule, therefore, may not applyto all types of data ingestion pipelines, such as those that havedifferent environment sizes or different data ingestion pipelinecomponents than the data ingestion pipeline from which the rule wasoriginally created.

Finally, even if a data ingestion pipeline issue is identified, theadministrator may not know why the issue occurred or what could be doneto resolve the issue. An alert may merely provide an administrator withinformation indicating what issue occurred.

Accordingly, described herein are operations for processing ingesteddata in an asynchronous manner as the data is being ingested or streamedto detect potential anomalies. For example, the data being ingested maybe job manager logs (e.g., job manager logs originating from an APACHEFLINK dataflow engine, where the job manager logs describe events thatoccurred as a result of a job manager of the APACHE FLINK dataflowengine scheduling tasks, coordinating checkpoints, coordinating recoveryon failures, etc.), task manager logs (e.g., task manager logsoriginating from an APACHE FLINK dataflow engine, where the task managerlogs describe events that occurred as a result of a task manager of theAPACHE FLINK dataflow engine executing tasks), and/or any other type(s)of application logs (e.g., any Kubernetes logs). One or more of thestreaming data processors 308 separate from the streaming dataprocessor(s) 308 configured with one or more data transformation rulesto transform messages and republish the messages to one or both of theintake ingestion buffer 306 and the output ingestion buffer 310 can jointhe job manager and task manager logs (and/or any other type(s) ofapplication logs) as the logs are ingested. For example, the job managerlogs and task manager logs may each include a job ID field. Thestreaming data processor(s) 308 can join the job manager and taskmanager logs using the job ID field, which correlates data for executedtasks with jobs that scheduled the tasks. Alternatively, the job managerand task manager logs (and/or other type(s) of application logs) mayhave been joined or combined prior to being ingested by the intakesystem 210.

The streaming data processor(s) 308 can then convert the joined logsinto a comparable data structure (e.g., a string vector), determinewhether the comparable data structure should be assigned to an existingdata pattern or a new data pattern, and optionally update acharacteristic of the data pattern to which the comparable datastructure is assigned. The streaming data processor(s) 308 can performthese operations without an administrator first providing a query orotherwise attempting to investigate the characteristics of the ingesteddata. Thus, an administrator may not need to understand the specificquery language used to produce query results. Rather, the streaming dataprocessor(s) 308 can perform these operations automatically in real-time(e.g., as soon as data is ingested or while the data is streamed) or inbatches (e.g., periodically every minute, hour, day, week, etc.). Onceone or more comparable data structures have been assigned to one or moredata patterns, the streaming data processor(s) 308 can analyze thecomparable data structures assigned to a particular data pattern todetermine whether any of the comparable data structures appear to beanomalous. The streaming data processor(s) 308 or another component ofthe data intake and query system 108 can then generate user interfacedata that, when rendered by a client device 204, causes the clientdevice to display a user interface depicting identified patterns in theingested data, detected anomalies, and/or other correspondinginformation.

Separately, one or more of the streaming data processors 308 can obtainpipeline metrics describing the operation of the data ingestionpipeline, which can include the forwarder 302, the data retrievalsubsystem 304, the intake ingestion buffer 306, other streaming dataprocessor(s) 308 (e.g., streaming data processor(s) 308 other than thestreaming data processor(s) 308 being used to detect anomalies iningested data and/or in the data ingestion pipeline itself, such as thestreaming data processor(s) 308 configured with one or more datatransformation rules to transform messages and republish the messages toone or both of the intake ingestion buffer 306 and the output ingestionbuffer 310), the output ingestion buffer 310, and/or any other componentof the intake system 210, not shown. Pipeline metrics may can includebytes transferred per second within the data ingestion pipeline, bytesingested per second within the data ingestion pipeline, bytes outputtedper second from the data ingestion pipeline, latency of the dataingestion pipeline, processor usage of some or all of the componentswithin the data ingestion pipeline, memory usage of some or all of thecomponents within the data ingestion pipeline, number of eventsprocessed by the data ingestion pipeline over a period of time, and/orthe like. Different pipeline metrics corresponding to the same timeinstant or time period can be ingested. The streaming data processor(s)308 can perform a multi-variate time-series outlier detection on theingested pipeline metric(s) to determine an outlier score for thepipeline metric(s).

The streaming data processor(s) 308 can then identify anomalous logs(e.g., based on converting the logs into a comparable data structure,assigning the comparable data structure to a data pattern, and analyzingthe comparable data structures assigned to the data pattern, asdescribed above) corresponding to the same time instant or time periodas the ingested pipeline metric(s), if present, and combine an anomalyscore of the anomalous logs (e.g., which may be a distance between theanomalous logs and a center of a cluster defining the nearest datapattern) with the outlier score to form a combined score. The streamingdata processor(s) 308 can apply a certain weight to the anomaly scoreand a certain weight to the combined score, and sum the weighted scoresto form the combined score. The weights, however, can be adjusted overtime based on user feedback that indicates whether the logs wereactually anomalous and/or whether the pipeline metrics were actuallyoutliers or anomalous. If the combined score exceeds a threshold, thismay indicate that the ingested pipeline metric(s) are truly anomalousand not false positives. Thus, the streaming data processor(s) 308 oranother component of the data intake and query system 108 can thengenerate a user interface or alert that indicates that the ingestedpipeline metric(s) are anomalous and use the anomalous logs to explain areason why the ingested pipeline metric(s) are anomalous.

The architecture of the components that enable the anomaly detectionfunctionality described herein is described below with respect to FIGS.34A-34C.

4.15.1. Anomaly Detection Architecture

To implement the anomaly detection functionality described herein, thestreaming data processor 308 can run various tasks, including a raw dataconverter 3402, one or more pattern matchers 3404, an anomaly detector3406, one or more pipeline metric outlier detectors 3408, and ananomalous metric identifier 3410, as shown in FIG. 34A. The raw dataconverter 3402 can join ingested pieces of data prior to a conversion.For example, the ingested pieces of data can include job manager logs,task manager logs, and/or one or more other types of application logs.Each log may include a job ID field, and the raw data converter 3402 canuse the job ID field to join one or more logs (e.g., join logs that havethe same job ID), thereby correlating tasks with jobs that caused thetasks to be executed. Alternatively, the job manager logs and the taskmanager logs (and/or other type(s) of application logs) may have beenjoined prior to being received by the raw data converter 3402, andtherefore the raw data converter 3402 may not perform any joinoperation.

The raw data converter 3402 can be configured to convert ingested datainto a comparable data structure. Specifically, the raw data converter3402 can parse an ingested piece of data (e.g., task manager logs, jobmanager logs, and/or other type(s) of application logs that describevarious events) and identify delimiters (e.g., blank spaces, commas,periods, semicolons, dashes, pipes, and/or any other character that mayseparate two items, such as two tokens) in the ingested piece of databased on the parsing. A delimiter may separate two tokens (e.g.,character strings denoting a field, a value, a function, an operation,etc.), and therefore the raw data converter 3402 can identify thetoken(s) (and the number thereof) in the ingested piece of data once thedelimiters are identified (e.g., the number of tokens in the ingestedpiece of data may be the number of character strings separated bydelimiters in the ingested piece of data). The raw data converter 3402can then create a comparable data structure (e.g., a string vector) inwhich each element of the comparable data structure is an identifiedtoken in the ingested piece of data. The raw data converter 3402 maypreserve the order in which the tokens appear in the ingested piece ofdata such that the first element in the comparable data structure is thefirst token that appears in the ingested piece of data, the secondelement in the comparable data structure is the second token thatappears in the ingested piece of data, and so on.

One or more of the pattern matchers 3404 can be configured to determinewhether the created comparable data structure matches any existing datapattern or whether the created comparable data structure should beassigned a new data pattern. For example, if the volume of data beingingested is less than a threshold or the cardinality of the data beingingested (e.g., the number of users corresponding to ingested data, thenumber of devices corresponding to the ingested data, the number ofdifferent types of logs that comprise the ingested data, etc.) is lessthan a threshold, then the streaming data processor(s) 308 can spin upor launch a single pattern matcher 3404 to determine whether the createdcomparable data structure matches any existing data pattern or whetherthe created comparable data structure should be assigned a new datapattern. However, if the volume of data being ingested is greater than athreshold or the cardinality of the data being ingested is greater thana threshold, then the streaming data processor(s) 308 can spin up orlaunch multiple pattern matchers 3404 that collectively determinewhether the created comparable data structure matches any existing datapattern or whether the created comparable data structure should beassigned a new data pattern, which is described in greater detail belowwith respect to FIG. 34B.

The pattern matcher(s) 3404 can store information for one or more datapatterns, which may also be referred to herein as “templates.” A datapattern or template may include one or more alphanumeric strings andzero or more wildcards separated by delimiters. Each alphanumeric stringmay represent a token that is present in each comparable data structureassigned to the data pattern or template at the same position. Awildcard may indicate that the comparable data structure(s) assigned tothe data pattern or template include two or more different values forthe token corresponding to the position of the wildcard. As anillustrative example, a data pattern or template may be as follows: “<*>RAS KERNEL INFO <*> ddr error(s) detected and corrected on rank 0,symbol <*> bit <*>.” In this example, “<*>” represents a wildcard, eachword or number represents an alphanumeric string, and the blank spacesbetween the wildcards, words, and numbers represent delimiters. Thus, acomparable data structure assigned to this data pattern or template mayinclude any value as a first token, “RAS” or “RAS KERNEL INFO” as asecond token, any value as the next token, and so on. In someembodiments, a comparable data structure may not be assigned to thisdata pattern or template if the comparable data structure does notinclude “RAS” or “RAS KERNEL INFO” as its second token (unless thestreaming data processor(s) 308 subsequently modifies the data patternor template to replace “RAS” or “RAS KERNEL INFO” with a wildcard).

To determine whether the created comparable data structure matches anyexisting data pattern or whether the created comparable data structureshould be assigned a new data pattern, the pattern matcher(s) 3404 canidentify existing data patterns, if any, that correspond to comparabledata structures that have the same number of tokens as the number oftokens identified by the raw data converter 3402 in the createdcomparable data structure. In other words, the pattern matcher(s) 3404identifies existing data patterns, if any, to which string vectors areassigned that have a string vector length that is the same as the stringvector length of the string vector created by the raw data converter3402 for the ingested piece of data. The pattern matcher(s) 3404 thenonly compares the string vector created by the raw data converter 3402with these existing data patterns. In this way, the pattern matcher(s)3404 can reduce the number of comparisons that are made to assign thecreated comparable data structure to a data pattern, thereby reducinganomaly detection times and the amount of computing resources dedicatedto detecting anomalies in ingested data.

Generally, a data pattern can be represented by a cluster having acentroid. Each token position of the data pattern can represent adimension in an m-dimensional space. Thus, the location of a centroid ofa cluster (e.g., the location of a center or centroid of a data pattern)in the m-dimensional space can be determined by the pattern matcher(s)3404 based on the average token values of the comparable data structuresassigned to the data pattern. For example, if a token value at a firsttoken position is a number, the pattern matcher(s) 3404 can add all ofthe token values of the comparable data structures assigned to a datapattern that correspond to a first token position (e.g., a firstdimension) and divide by the number of comparable data structuresassigned to the data pattern to determine the first dimension value ofthe centroid of the data pattern. If a token value at a first tokenposition is a string, the pattern matcher(s) 3404 can assign numericalvalues to each distinct string present in a comparable data structureassigned to the data pattern, add all of the assigned numerical values,and divide the sum by the number of comparable data structures assignedto the data pattern to determine the first dimension value of thecentroid of the data pattern. The pattern matcher(s) 3404 can repeatthese operations for each dimension to determine m dimension values thatrepresent the centroid of the data pattern. As described above, datapatterns can include a different number of tokens. Thus, the value of mmay be different based on the number of tokens (e.g., the number oftoken positions) present in a data pattern.

A user or the system can set a k value that represents a number ofclusters (e.g., data patterns) that should be created to whichcomparable data structures can be assigned. However, the comparable datastructure assignment described herein can occur even if a k value is notset by a user or system. In an embodiment in which anomalies aredetected in ingested pieces of data in real-time, the first time acomparable data structure is created—before any data patterns have beencreated by the pattern matcher(s) 3404—the pattern matcher(s) 3404 canassign the first comparable data structure to a new data pattern thatmatches the first comparable data structure. The second time acomparable data structure is created, the pattern matcher(s) 3404 canassign the second comparable data structure to a new data pattern aswell that matches the second comparable data structure. This process cancontinue for each subsequent comparable data structure until k datapatterns have been created.

At this point, the pattern matcher(s) 3404 can evaluate the nextcomparable data structure (e.g., the k+1 comparable data structure toarrive) to determine whether the next comparable data structure shouldbe assigned to one of the k existing data patterns or whether the nextdata structure should be assigned to a new data pattern, and the patternmatcher(s) 3404 can then assign the next comparable data structure tothe appropriate data pattern. For example, the pattern matcher(s) 3404can maintain a facility cost, which is also referred to herein as aminimum cluster distance. As described above, each data pattern includesa certain number of tokens. The pattern matcher(s) 3404 may determine adistance (e.g., a Euclidean distance, a Cosine distance, a Jaccarddistance, an edit distance, etc.) between each data pattern having thesame number of tokens, and repeat this determination for each set ofdata patterns having the same number of token. Specifically, the patternmatcher(s) 3404 may determine a distance between the location of acenter of a first data pattern and the location of a center of a seconddata pattern having the same number of tokens as the first data pattern.For each set of data patterns having the same number of tokens, thepattern matcher(s) 3404 can determine the smallest distance between datapatterns and set this distance as the minimum cluster distance for therespective set of data patterns. Thus, the pattern matcher(s) 3404 maydetermine multiple minimum cluster distances, one for each set of datapatterns having the same length (e.g., the same number of tokens ortoken positions). The pattern matcher(s) 3404 can then determine adistance (e.g., a Euclidean distance, a Cosine distance, a Jaccarddistance, an edit distance, etc.) between the next comparable datastructure and each existing data pattern having the same number oftokens as the next comparable data structure. If the pattern matcher(s)3404 determines that this distance is less than or equal to the minimumcluster distance corresponding to the set of data patterns having thesame number of tokens as the next comparable data structure, this mayindicate that the next comparable data structure is close enough to oneof the existing data patterns to be assigned thereto. Thus, the patternmatcher(s) 3404 can assign the next comparable data structure to thedata pattern closest (e.g., by distance) to the next comparable datastructure. Alternatively, the pattern matcher(s) 3404 can compare thenext comparable data structure to the existing data patterns having thesame number of tokens to determine whether the next comparable datastructure matches any of these existing data patterns. For example, thepattern matcher(s) 3404 can compare each element of the next comparabledata structure with a token in an existing data pattern that has thesame position as the respective element (e.g., the pattern matcher(s)3404 can compare the first element with the first token in an existingdata pattern, the second element with the second token in an existingdata pattern, and so on), counting the number of times the element andcorresponding token match. The pattern matcher(s) 3404 can then dividethe number of times the element and corresponding token match for agiven existing data pattern by a length of the next comparable datastructure (e.g., by the number of tokens included therein) to produce amatch percentage. The pattern matcher(s) 3404 can assign the nextcomparable data structure to the existing data pattern that produces thehighest match percentage. As part of the assignment, the patternmatcher(s) 3404 can increase a weight of the data pattern by 1 (or anylike value) to reflect that 1 additional comparable data structure hasbeen assigned to the data pattern (e.g., update a count of a number ofcomparable data structures assigned to the data pattern to reflect thata new comparable data structure has been assigned to the data pattern)and can adjust a centroid of the data pattern to account for the newlyassigned comparable data structure. Specifically, the pattern matcher(s)3404 can update the centroid of the data pattern by averaging the tokenvalues of the comparable data structures previously assigned to the datapattern and of the next comparable data structure to form an updated mdimension values representing the centroid. Because the centroid of thedata pattern has been updated, the pattern matcher(s) 3404 can alsorecalculate the minimum cluster distance for the data pattern(s) thathave the same number of tokens as the data pattern to which the nextcomparable data structure is assigned, and the recalculated minimumcluster distance can be used by the pattern matcher(s) 3404 in futuredata pattern assignment operations.

However, if the pattern matcher(s) 3404 determines that this distance isgreater than the minimum cluster distance corresponding to the set ofdata patterns having the same number of tokens as the next comparabledata structure, this may indicate that the next comparable datastructure is too far from any of the existing data patterns having thesame number of tokens as the next comparable data structure. Thus, thepattern matcher(s) 3404 can assign the next comparable data structure toa new data pattern. Because creation of the new data pattern means thatthe number of data patterns having the same number of tokens as presentin the new data pattern has increased, the pattern matcher(s) 3404 cancalculate or recalculate the minimum cluster distance for the datapattern(s) that have the same number of tokens as the new data patternto which the next comparable data structure is assigned, and therecalculated minimum cluster distance can be used by the patternmatcher(s) 3404 in future data pattern assignment operations.

If the pattern matcher(s) 3404 assigns a comparable data structure to anexisting data pattern, the pattern matcher(s) 3404 can determine whetherthe existing data pattern properly describes the comparable datastructure. In particular, the pattern matcher(s) 3404 can determinewhether any elements of the comparable data structure do not match thecorresponding tokens of the assigned data pattern (where an element ofthe comparable data structure is considered to match a token of theassigned data pattern if the value of the element is an alphanumericstring that matches the alphanumeric string of the token or if the tokenis a wildcard). If an element does not match a corresponding token, thenthe pattern matcher(s) 3404 can replace the token with a wildcard,thereby modifying the assigned data pattern to include a wildcard inplace of the alphanumeric string that was previously present. As anillustrative example, if the comparable data structure has the value“1074” in the fourth element, but the fourth token of the assigned datapattern is “74,” then the pattern matcher(s) 3404 can modify the fourthtoken in the assigned data pattern to be “<*>” instead of “74.” Whenmodifying the data pattern to include a wildcard in place of analphanumeric string, the pattern matcher(s) 3404 can generate metadataassociated with the data pattern identifying the specific alphanumericvalues or a range of alphanumeric values represented by the wildcard. Inother words, the pattern matcher(s) 3404 can generate metadata to trackwhat alphanumeric values are represented by a wildcard.

If the pattern matcher(s) 3404 assigns a comparable data structure to anew data pattern, the pattern matcher(s) 3404 can define the new datapattern as being the elements of the comparable data structure. Asadditional pieces of ingested data are obtained and processed, thepattern matcher(s) 3404 may modify this new data pattern to describemultiple comparable data structures (e.g., the pattern matcher(s) 3404may replace some tokens that describe the data pattern with wildcards).

The pattern matcher(s) 3404 can continue these operations for subsequentcomparable data structures while the number of data patterns is greaterthan k and until the number of data patterns equals a threshold (e.g., athreshold that is on the order of k log₁₀ n, where n is the number ofcomparable data structures that have been received up to that point) oruntil a threshold period of time has passed. Once the number of datapatterns reaches the threshold or the threshold period of time haspassed, the pattern matcher(s) 3404 can perform a merge operation toreduce the number of data patterns. For example, the pattern matcher(s)3404 can use a clustering algorithm (e.g., k-means ++)—treating eachdata pattern as a separate point to cluster—to generate a new, smallerset of data patterns in which one or more of the existing data patternshave been merged together. For example, the clustering algorithm cantake one or more passes (e.g., 1, 2, 3, etc.) on the existing datapatterns to generate the new, smaller set of data patterns. Datapatterns may be merged by the pattern matcher(s) 3404 hierarchically,meaning that two or more data patterns can be merged together to form asingle, merged data pattern and one or more sets of data patterns can beseparately merged together. The pattern matcher(s) 3404 can re-assigncomparable data structures that were previously assigned to the datapatterns that were merged to the merged data pattern. A merged datapattern may have a definition that appropriately describes each of thecomparable data structures that were previously assigned to the datapatterns that were merged to form the merged data pattern and that arenow assigned to the merged data pattern. As an illustrative example, ifthe data pattern “<*> RAS LINKCARD INFO MidplaneSwitchControllerperforming bit sparing on <*> bit <*>” and the data pattern “<*> RASLINKCARD INFO DownplaneSwitchController performing bit sparing on <*>bit <*>” are merged, the merged data pattern may be “<*> RAS LINKCARDINFO <*> performing bit sparing on <*> bit <*>” (e.g., where“MidplaneSwitchController” and “DownplaneSwitchController” are replacedwith a wildcard). The pattern matcher(s) 3404 can then continue theseoperations for each subsequent comparable data structure that iscreated.

Because the number of data patterns may be reduced after a mergeoperation, the pattern matcher(s) 3404 can recalculate the minimumcluster distance for the data pattern(s) that have the same number oftokens as the data pattern(s) that were merged together, and therecalculated minimum cluster distance can be used by the patternmatcher(s) 3404 in future data pattern assignment operations. In someembodiments, a merge operation causes the minimum cluster distance toincrease given that fewer data patterns remain Because the patternmatcher(s) 3404 creates a new data pattern when the distance between acomparable data structure and the closest data pattern is greater thanthe minimum cluster distance, the increase in the minimum clusterdistance from the merge operation may inherently cause the number of newdata patterns being created to remain low. Thus, the number of datapatterns may gravitate toward being k rather than the threshold,increasing accuracy and reducing computational costs.

Because the data to cluster is known when clustering occurs offline(e.g., not in real-time, but sometime after data has been ingested andstored, such as periodically in batches), a traditional clusteringalgorithm can run multiple passes on the data and produce exactly k (orfewer) clusters. When attempting to cluster data online or in real-time(e.g., when attempting to assign comparable data structures to datapatterns online or in real-time), data previously received is known, butthe data to be received in the future is unknown. To use a traditionalclustering algorithm, the pattern matcher(s) 3404 would have to obtainthe previously created comparable data structures and a comparable datastructure that was just created, and apply the traditional clusteringalgorithm to these comparable data structures to obtain a new set ofdata patterns to which the comparable data structures are assigned. Thepattern matcher(s) 3404 would then have to repeat these operations eachtime a new comparable data structure or a new set of comparable datastructures are received. The pattern matcher(s) 3404 described hereinare capable of assigning comparable data structures to data patterns inbatches using a traditional clustering algorithm (e.g., k-meansclustering) in a manner as described above. It may be toocomputationally costly, however, for the pattern matcher(s) 3404 togenerate new data patterns and re-assign previously created comparabledata structures to the new data patterns each time a new comparable datastructure is received using a traditional clustering algorithm. As eachnew comparable data structure is received, the number of comparable datastructures to assign to a data pattern would grow. Over time, thelatency of the streaming data processor(s) 308 would increase, therebyincrementally increasing anomaly detection times.

The clustering algorithm described above as being implemented by thepattern matcher(s) 3404, however, can allow the pattern matcher(s) 3404to accurately assign comparable data structures to data patterns onlineor in real-time without experiencing the incrementally higher delay orcomputational costs that would result from using a traditionalclustering algorithm. The underlying theory that a clustering algorithmprocessing data online can be competitive, in terms of accuracy, with atraditional clustering algorithm is described in greater detail inLiberty et al., “An Algorithm for Online K-Means Clustering,” submittedon Feb. 23, 2015, which is hereby incorporated by referenced herein inits entirety. To achieve this technical benefit, the pattern matcher(s)3404 may not necessarily create exactly k clusters or data patterns.Rather, the pattern matcher(s) 3404 may maintain a number of datapatterns greater than k and less than the threshold (e.g., a thresholdthat is on the order of k log₁₀ n, where n is the number of comparabledata structures that have been received up to that point), with thenumber of data patterns generally being closer to k than to thethreshold. The pattern matcher(s) 3404 may maintain this number of datapatterns even after a merge operation occurs. Thus, the patternmatcher(s) 3404 can create data patterns, assign comparable datastructures to data patterns, and merge data patterns in real-timewithout being negatively affected by the drawbacks associated with usinga traditional clustering algorithm.

4.15.1.1. Pattern Matching Distributed Architecture

As described above, the streaming data processor(s) 308 can launchmultiple pattern matchers 3404 if the volume of the ingested dataexceeds a threshold and/or the cardinality of the ingested data exceedsa threshold. Typically, systems that process data in batches have atraining phase and a scoring phase. In the training phase, a trainingsystem can perform multiple passes on stored, known data to generate amodel for processing future data. In the scoring phase, a productionsystem can use the model to process ingested data. If the productionsystem fails, the failure does not result in a loss of the model becausethe model is static. In other words, the production system had not beenupdating the model based on the ingested data. Rather, the model used bythe production system remained in the same state as when the model wasgenerated by the training system. A new production system can beinstantiated to replace the failed production system, and the model cansimply be exported from the training system to the new productionsystem, allowing data processing to continue without error. Whenprocessing data online or in real-time, however, the model is notstatic. Specifically, when processing data online or in real-time, thedata is constantly being streamed to the data ingestion pipeline. As aresult, the data ingestion pipeline is continuously processing thestreamed data, learning from the data as the data is streamed andupdating the model based on the learning. The model, therefore, is notstatic or a snapshot from a certain moment in time. A failure of a taskin the data ingestion pipeline could thus result in a loss of themost-recent model, thereby reducing the accuracy of the data ingestionpipeline processing. Launching multiple pattern matchers 3404, however,can alleviate these issues, allowing the data ingestion pipeline toconstantly learn and be fault tolerant regardless of whether the volumeof the ingested data exceeds a threshold and/or the cardinality of theingested data exceeds a threshold. In fact, launching multiple patternmatchers 3404 in the architecture described herein can allow the dataingestion pipeline to pause and upgrade the data ingestion pipelinelogic (e.g., incorporate new clustering algorithms (e.g., to improvecluster accuracy) and/or incorporate new steps in the data ingestionpipeline (e.g., to make the pipeline more efficient)) without causingthe data ingestion pipeline to re-learn the model. Rather, the patternmatcher(s) 3404 can continue to use the most-recently learned modelafter the upgraded data ingestion pipeline logic is incorporated and thedata ingestion pipeline resumes.

For example, the pattern matcher(s) 3404 can be separated into localpattern matchers 3404A-3404D and a global pattern matcher 3404N, asshown in FIG. 34B. In other words, the streaming data processor(s) 308can launch multiple pattern matcher 3404 tasks, with some patternmatcher 3404 task(s) operating as local task(s) and other patternmatcher 3404 task(s) operating as global task(s). The clusteringalgorithm described herein can be written such that the clusteringalgorithm can be distributed to the local pattern matchers 3404A-3404Dand/or the global pattern matcher 3404N such that each pattern matcher3404A-3404D and 3404N can run the clustering algorithm. In addition, theclustering algorithm can be written such that execution of theclustering algorithm is fast (e.g., the number of requests per secondthat can be processed by the clustering algorithm is high), allowing alarger volume of data to be processed. While FIG. 34B depicts four localpattern matchers 3404A-3404D and one global pattern matcher 3404N, thisis not meant to be limiting. Any number of local pattern matchers 3404and/or global pattern matchers 3404 may be launched by the streamingdata processor(s) 308.

The streaming data processor(s) 308 can launch one or more sets ofpattern matchers 3404A-3404D and 3404N, with each set processingingested data for a user, a set of users, a device, a set of devices, acertain set of data, and/or the like. Each local pattern matcher3404A-3404D can perform the same operations as described above withrespect to the pattern matcher(s) 3404. Specifically, a local patternmatcher 3404A-3404D can assign a comparable data structure to anexisting data pattern or a new data pattern and periodically merge datapatterns in a manner as described above.

The local pattern matchers 3404A-3404D, however, may each receive adifferent set of data. For example, the volume or cardinality of datamay be large such that having one pattern matcher 3404A-3404D processall of the data may be too overwhelming for the single pattern matcher3404A-3404D to handle in a timely manner Thus, the stream of ingesteddata can be broken up into chunks and each local pattern matcher3404A-3404D can process a portion of the stream (e.g., one or morechunks) rather than the entire stream. Specifically, each local patternmatcher 3404A-3404D can process a certain portion of the comparable datastructures. Accordingly, as illustrated in FIG. 34B, the local patternmatcher 3404A receives ingested data 1 (e.g., a first set of comparabledata structures), the local pattern matcher 3404B receives ingested data2 (e.g., a second set of comparable data structures), the local patternmatcher 3404C receives ingested data 3 (e.g., a third set of comparabledata structures), and the local pattern matcher 3404D receives ingesteddata 4 (e.g., a fourth set of comparable data structures) as the data isingested in real-time. In some embodiments, not shown, the streamingdata processor(s) 308 can launch multiple raw data converters 3402 thatmay or may not have a 1-to-1 mapping to the local pattern matchers3404A-3404D to facilitate the conversion of the ingested data into thecomparable data structures.

Because the local pattern matchers 3404A-3404D each receive a differentset of data, the data patterns created by each local pattern matcher3404A-3404D may be different. In fact, the number of data patternscreated by each local pattern matcher 3404A-3404D at any given time maybe different given that the merge operations periodically performed bythe local pattern matchers 3404A-3404D may result in different levels ofdata pattern consolidation. As a result, the local pattern matcher 3404Amay create a first data pattern set, the local pattern matcher 3404B maycreate a second data pattern set, the local pattern matcher 3404C maycreate a third data pattern set, and the local pattern matcher 3404D maycreate a fourth data pattern set.

As described above, each local pattern matcher 3404A-3404D does notprocess each ingested piece of data. Rather, each local pattern matcher3404A-3404D processes a portion thereof. Thus, periodically, when acertain volume of data has been processed, or when the number of datapatterns created by any or all of the local pattern matchers 3404A-3404Dreaches a threshold (e.g., a threshold on the order of k log₁₀ n), theglobal pattern matcher 3404N can merge the data patterns created by theindividual local pattern matchers 3404A-3404D to create a merged datapattern set that is based on all of the ingested data to that point. Forexample, the global pattern matcher 3404 can use a clustering algorithm(e.g., k-means ++) to merge the first, second, third, and fourth datapattern sets—treating each data pattern in the sets as a point tocluster—in a manner as described above to create the merged data patternset. The merged data pattern set may incorporate characteristics learnedfrom all of the data ingested to that point rather than just a subset ofthe data ingested to that point and processed by an individual localpattern matcher 3404A-3404D, as is true with the first, second, third,and fourth data pattern sets. The global pattern matcher 3404N can thenfeed the merged data pattern set back to the individual local patternmatchers 3404A-3404D so that the individual local pattern matchers3404A-3404D can continue to process ingested data (e.g., assigncomparable data structures to data patterns and/or merge data patterns)using the merged data pattern set rather than the data pattern setoriginally created by the individual local pattern matcher 3404A-3404D.As the local pattern matchers 3404A-3404D process newly ingested data(e.g., assign comparable data structures to data patterns and/or mergedata patterns) using the merged data pattern set, each local patternmatcher 3404A-3404D may modify the merged data pattern set in differentways. However, the global pattern matcher 3404N can subsequently mergethese modified data pattern sets and provide this most-recently mergeddata pattern set to the local pattern matcher(s) 3404A-3404D for use inprocessing data ingested in the future (e.g., for use in assigningcomparable data structures to data patterns and/or merging datapatterns), and the cycle can continue. Thus, the architecture describedherein includes nested merge operations, where the local patternmatchers 3404A-3404D may each regularly perform merge operations ontheir own data pattern sets in a manner as described herein, and thenthe global pattern matcher 3404N can perform a merge operation on thedata pattern sets created by the local pattern matchers 3404A-3404Dperiodically, when a certain volume of data has been processed, or whenthe number of data patterns created by any or all of the local patternmatchers 3404A-3404D reaches a threshold. Alternatively, one or more ofthe local pattern matchers 3404A-3404D can merge the data pattern setscreated by the local pattern matchers 3404A-3404D rather than the globalpattern matcher 3404N (thereby resulting in the streaming dataprocessor(s) 308 declining to launch the global pattern matcher 3404N).

Thus, the feedback architecture described herein ensures that thepattern matcher(s) 3404A-3404D and 3404N are constantly learning andproducing updated or merged data pattern sets. In fact, use of the localpattern matcher(s) 3404A-3404D further increases fault tolerance andallows for the data ingestion pipeline logic to be upgraded withoutdisruption to the data ingestion pipeline itself. For example, eachalgorithm implemented by and/or each model (e.g., data pattern set)created by the local pattern matcher(s) 3404A-3404D and/or the globalpattern matcher 3404N can be converted into, mapped to, and/or backed upby a FLINK operator (e.g., a stateful FLINK operator). Converting,mapping, or backing up the algorithms into FLINK operators can allow thealgorithms to run on local tasks (e.g., the local pattern matchers3404A-3404D). The FLINK operator (e.g., the stateful FLINK operator) mayperiodically store its state in a keyed state store. If a local patternmatcher 3404A-3404D fails, the streaming data processor(s) 308 cansimply launch a new local pattern matcher 3404A-3404D to replace thefailed local pattern matcher 3404A-3404D and retrieve the FLINK operatorcorresponding to the failed local pattern matcher 3404A-3404D from thekeyed state store such that the algorithm and/or model (e.g., datapattern set) represented by the FLINK operator can be applied to the newlocal pattern matcher 3404A-3404D. In other words, the streaming dataprocessor(s) 308 can recreate the failed local pattern matcher3404A-3404D using the FLINK operator stored in the keyed state store.Applying the algorithm and/or model represented by the FLINK operator tothe new local pattern matcher 3404A-3404D allows the new local patternmatcher 3404A-3404D to operate using the backed up algorithm and/ormodel (e.g., data pattern set), thereby allowing the data ingestionpipeline to continue operations without losing the state of the failedlocal pattern matcher 3404A-3404D.

As another example, the FLINK operator may have a migration policy thatthe streaming data processor(s) 308 can use to determine whetherupgraded data ingestion pipeline logic (e.g., to replace or upgrade thealgorithm) is compatible with the models (e.g., data patterns) createdby the local pattern matcher(s) 3404A-3404D (e.g., to determine whetherupgraded data ingestion pipeline logic can read the models). If thestreaming data processor(s) 308 determine that the upgraded dataingestion pipeline logic is compatible with the models (e.g., datapatterns), the streaming data processor(s) 308 can pause and/or refreshthe data ingestion pipeline to incorporate the upgraded data ingestionpipeline logic (which can include a new FLINK operator representing anew algorithm, a new pipeline step, etc.). The streaming dataprocessor(s) 308 can then resume the data ingestion pipeline from theprevious state, using the previously learned models (e.g., the mostrecent set of data patterns) and the upgraded data ingestion pipelinelogic (e.g., the new or upgraded clustering algorithm) to processingested data (e.g., comparable data structures). Thus, the models donot need to be re-learned when the data ingestion pipeline logic isupgraded.

The raw data converter 3402 and the pattern matcher(s) 3404 can performthe operations described herein as each new ingested piece of data isobtained (and prior to such ingested data being indexed and stored).Thus, the pattern matcher(s) 3404 can assign a representation of eachnew ingested piece of data (e.g., a comparable data structure createdfrom the ingested piece of data) to a data pattern in sequence as therespective ingested data piece is obtained, thereby performing astreaming, online data pattern assignment operation.

4.15.1.2. Anomaly Detection in Logs

The anomaly detector 3406 can be configured to detect potentialanomalies in the ingested data as the data is ingested or periodicallyin batches, such as every minute, every hour, every day, etc. In otherwords, the anomaly detector 3406 can be configured to detect anomalousevents in the joined logs as the logs are ingested or periodically inbatches. Specifically, the anomaly detector 3406 can detect anomalies intoken values and/or anomalous data patterns. If an ingested piece ofdata (e.g., job manager logs, task manager logs, and/or other type(s) ofapplication logs describing the occurrence of various events) has ananomalous token value or corresponds to an anomalous data pattern, thenthe ingested piece of data may be considered to describe an anomalousevent. For example, to detect potential token value anomalies in theingested data as the data is ingested, the anomaly detector 3406 canidentify the data pattern assigned to a comparable data structurecreated for a current ingested piece of data being processed andidentify token values represented by the wildcard(s) of the data pattern(e.g., by retrieving metadata including such information from thepattern matcher(s) 3404). If the values for a particular token arenumbers, the anomaly detector 3406 can determine percentiles of therange of values for that token (e.g., 25th percentile, 50th percentile,75th percentile, etc.), the mode of the values for that token, themedian of the values for that token, the mean of the values for thattoken, and/or other like statistics. If the values for a particulartoken are letter(s) or word(s), the anomaly detector 3406 can count thenumber of times a letter or word appears as a value for the token anddetermine the percentiles or other statistics as described above. Theanomaly detector 3406 can then use the percentiles to determine whetherthe value of a token present in the current ingested piece of data isanomalous. As an illustrative example, if the value of a token presentin the current ingested piece of data falls below the 25th percentile(e.g., the value is too low—if a number—or appears a small number oftimes—if a letter or word) and/or falls above the 75th percentile (e.g.,the value is too high—if a number—or appears a large number of times—ifa letter or word), then the anomaly detector 3406 may flag this ingestedpiece of data and the token value as being anomalous.

To detect potential anomalous data patterns in the ingested data as thedata is ingested, the anomaly detector 3406 can identify the datapattern assigned to a comparable data structure created for a currentingested piece of data being processed. If no other comparable datastructures have been assigned to this data pattern, the anomaly detector3406 can flag this ingested piece of data as being anomalous.

To detect potential token value anomalies in the ingested dataperiodically in batches, the anomaly detector 3406 can iterate throughsome or all of the data patterns created during this period and identifytoken values represented by the wildcard(s) of the respective datapattern (e.g., by retrieving metadata including such information fromthe pattern matcher 3404). If the values for a particular token arenumbers, the anomaly detector 3406 can determine percentiles of therange of values for that token (e.g., 25th percentile, 50th percentile,75th percentile, etc.), the mode of the values for that token, themedian of the values for that token, the mean of the values for thattoken, and/or the like. If the values for a particular token areletter(s) or word(s), the anomaly detector 3406 can count the number oftimes a letter or word appears as a value for the token and determinethe percentiles or other statistics as described above. The anomalydetector 3406 can then use the percentiles to determine whether thevalue of a token present in any of the pieces of ingested data assignedto the respective data pattern is anomalous. As an illustrative example,if the value of a token present in an ingested piece of data falls belowthe 25th percentile (e.g., the value is too low—if a number—or appears asmall number of times—if a letter or word) and/or falls above the 75thpercentile (e.g., the value is too high—if a number—or appears a largenumber of times—if a letter or word), then the anomaly detector 3406 mayflag this ingested piece of data and the token value as being anomalous.

To detect potential anomalous data patterns in the ingested dataperiodically in batches, the anomaly detector 3406 can iterate throughsome or all of the data patterns created during the period. If a datapattern has a small number of comparable data structures assignedthereto (e.g., 1, 2, 3, etc.), the anomaly detector 3406 can flag thepiece(s) of ingested data assigned to the data pattern as beinganomalous.

In further embodiments, the anomaly detector 3406 can also detectanomalies in sequences of logs. For example, individual logs may notinclude anomalous token values or be assigned to an anomalous datapattern. However, the sequence in which the logs are generated may beanomalous. Thus, pattern matcher(s) 3404 can use the techniquesdescribed herein to create log sequence clusters, assign sequences oflogs to the log sequence clusters, and merge log sequence clusters whenany of the conditions described herein are met. The anomaly detector3406 can then analyze the assigned log sequences, identifying those logsequences assigned to a log sequence cluster that have an occurrenceamong all of the log sequences assigned to the log sequence cluster lessthan a threshold or percentile or greater than a threshold or percentileas being anomalous or those log sequences assigned to a log sequencecluster having a small number (e.g., 1, 2, 3, etc.) of assigned logsequences as being anomalous.

The anomalies detected by the anomaly detector 3406 may be surfaced viaone or more user interfaces that can be displayed by a client device204. For example, the anomaly detector 3406 or another component in thedata intake and query system 108 can generate user interface data basedon the anomalies detected by the anomaly detector 3406 such that theuser interface data, when rendered by a client device 204, causes theclient device 204 to display one or more user interfaces depicting theanomaly information. Examples of such user interfaces are describedbelow with respect to FIGS. 35-40 .

4.15.1.3. Outlier Detection Distributed Architecture

One or more of the pipeline metric outlier detectors 3408 can beconfigured to perform a multi-variate time-series outlier detection oningested pipeline metrics. For example, if the volume of data beingingested is less than a threshold or the cardinality of the data beingingested (e.g., the number of users corresponding to ingested data, thenumber of devices corresponding to the ingested data, the number ofdifferent types of pipeline metrics that comprise the ingested data,etc.) is less than a threshold, then the streaming data processor(s) 308can spin up or launch a single pipeline metric outlier detector 3408 toperform the multi-variate time-series outlier detection. However, if thevolume of data being ingested is greater than a threshold or thecardinality of the data being ingested is greater than a threshold, thenthe streaming data processor(s) 308 can spin up or launch multiplepipeline metric outlier detectors 3408 that collectively perform amulti-variate time-series outlier detection, which is described ingreater detail below with respect to FIG. 34C.

The pipeline metric outlier detector(s) 3408 can receive one or morepipeline metrics that correspond to various time instants. The pipelinemetric outlier detector(s) 3408 can group different pipeline metricsthat correspond to the same time instant, and assign the groupedpipeline metrics to a metric cluster. Thus, a metric cluster may beassigned a first set of different pipeline metrics corresponding to afirst time, a second set of different pipeline metrics corresponding toa second time, and so on.

A metric cluster can be a cluster having a centroid. If the pipelinemetric outlier detector(s) 3408 groups m pipeline metrics for assignmentto a metric cluster, then the location of a center or centroid of ametric cluster may be in an m-dimensional space. Each dimension value inthe centroid, therefore, may be an average value of one of m differentpipeline metrics assigned to the metric cluster. For example, thepipeline metric outlier detector(s) 3408 can add all of the values of afirst type of metric corresponding to various time instants that areassigned to a metric cluster and divide by the number of first metrictypes that are assigned to the metric cluster to determine a dimensionvalue of the centroid of the metric cluster corresponding to the firsttype of metric. The pipeline metric outlier detector(s) 3408 can repeatthis operation for each type of metric assigned to the metric cluster.

The pipeline metric outlier detector(s) 3408 can store information forone or more metric clusters. For example, the information can includedata indicating the location of a centroid of the metric cluster(s),data indicating pipeline metrics and a timestamp of the pipeline metricsthat are assigned to a metric cluster, etc.

A user or the system can set a k value that represents a number ofclusters (e.g., metric clusters) that should be created to which groupedpipeline metrics can be assigned. However, the grouped pipeline metricsassignment described herein can occur even if a k value is not set by auser or system. In an embodiment in which anomalies are detected iningested pieces of data (e.g., in pipeline metrics) in real-time, thefirst time a group of pipeline metrics corresponding to the same timeinstant are obtained—before any metric clusters have been created by thepipeline metric outlier detector(s) 3408—the pipeline metric outlierdetector(s) 3408 can assign the first group of pipeline metrics to a newmetric cluster. Thus, the centroid of the new metric cluster may matchthe values of the first group of pipeline metrics. The second time agroup of pipeline metrics corresponding to the same time instant areobtained, the pipeline metric outlier detector(s) 3408 can assign thesecond group of pipeline metrics to a new metric cluster as well, wherethe centroid of the new metric cluster may match the values of thesecond group of pipeline metrics. This process can continue for eachsubsequent group of pipeline metrics corresponding to the same timeinstant until k metric clusters have been created.

At this point, the pipeline metric outlier detector(s) 3408 can evaluatethe next group of pipeline metrics corresponding to the same timeinstant (e.g., the k+1 group of pipeline metrics corresponding to thesame time instant) to determine whether the next group of pipelinemetrics corresponding to the same time instant should be assigned to oneof the k existing metric clusters or whether the next group of pipelinemetrics corresponding to the same time instant should be assigned to anew metric cluster, and the pipeline metric outlier detector(s) 3408 canthen assign the next group of pipeline metrics corresponding to the sametime instant to the appropriate metric cluster. For example, thepipeline metric outlier detector(s) 3408 can maintain a facility cost,which is also referred to herein as a minimum cluster distance. Thepipeline metric outlier detector(s) 3408 may determine a distance (e.g.,a Euclidean distance, a Cosine distance, a Jaccard distance, an editdistance, etc.) between each metric cluster. Specifically, the pipelinemetric outlier detector(s) 3408 may determine a distance between thelocation of a center of a first metric cluster and the location of acenter of a second metric cluster. The pipeline metric outlierdetector(s) 3408 can determine the smallest distance between metricclusters and set this distance as the minimum cluster distance. Thepipeline metric outlier detector(s) 3408 can then determine a distance(e.g., a Euclidean distance, a Cosine distance, a Jaccard distance, anedit distance, etc.) between the next group of pipeline metricscorresponding to the same time instant and each existing metric cluster.If the pipeline metric outlier detector(s) 3408 determines that thisdistance is less than or equal to the minimum cluster distance, this mayindicate that the next group of pipeline metrics corresponding to thesame time instant is close enough to one of the existing metric clustersto be assigned thereto. Thus, the pipeline metric outlier detector(s)3408 can assign the next group of pipeline metrics corresponding to thesame time instant to the metric cluster closest (e.g., by distance) tothe next group of pipeline metrics corresponding to the same time. Aspart of the assignment, the pipeline metric outlier detector(s) 3408 canincrease a weight of the metric cluster by 1 (or any like value) toreflect that 1 additional group of pipeline metrics corresponding to thesame time instant has been assigned to the metric cluster (e.g., updatea count of a number of groups of pipeline metrics corresponding to thesame time instant assigned to the metric cluster to reflect that a newgroup of pipeline metrics corresponding to the same time instant hasbeen assigned to the metric cluster) and can adjust a centroid of themetric cluster to account for the newly assigned group of pipelinemetrics corresponding to the same time instant. Specifically, thepipeline metric outlier detector(s) 3408 can update the centroid of themetric cluster by averaging the metric values of the group(s) ofpipeline metrics corresponding to the same time instant previouslyassigned to the metric cluster and of the next group of pipeline metricscorresponding to the same time instant to form an updated m dimensionvalues representing the centroid. Because the centroid of the metriccluster has been updated, the pipeline metric outlier detector(s) 3408can also recalculate the minimum cluster distance for the metricclusters, and the recalculated minimum cluster distance can be used bythe pipeline metric outlier detector(s) 3408 in future metric clusterassignment operations.

However, if the pipeline metric outlier detector(s) 3408 determines thatthis distance is greater than the minimum cluster distance, this mayindicate that the next group of pipeline metrics corresponding to thesame time instant is too far from any of the existing metric clusters.Thus, the pipeline metric outlier detector(s) 3408 can assign the nextgroup of pipeline metrics corresponding to the same time instant to anew metric cluster. Because creation of the new metric cluster meansthat the number of metric clusters has increased, the pipeline metricoutlier detector(s) 3408 can calculate or recalculate the minimumcluster distance for the metric clusters, and the recalculated minimumcluster distance can be used by the pipeline metric outlier detector(s)3408 in future metric cluster assignment operations.

In some embodiments, the pipeline metric outlier detector(s) 3408 canassign an outlier score to each group of pipeline metrics correspondingto the same time instant. For example, the pipeline metric outlierdetector(s) 3408 can determine a distance between a group of pipelinemetrics corresponding to the same time instant and a centroid of ametric cluster to which the group of pipeline metrics is assigned, andset this distance to be the outlier score.

The pipeline metric outlier detector(s) 3408 can continue theseoperations for subsequent groups of pipeline metrics corresponding tothe same time instant while the number of metric clusters is greaterthan k and until the number of metric clusters equals a threshold (e.g.,a threshold that is on the order of k log₁₀ n, where n is the number ofgroups of pipeline metrics corresponding to the same time instant thathave been received up to that point) or until a threshold period of timehas passed. Once the number of metric clusters reaches the threshold orthe threshold period of time has passed, the pipeline metric outlierdetector(s) 3408 can perform a merge operation to reduce the number ofmetric clusters. For example, the pipeline metric outlier detector(s)3408 can use a clustering algorithm (e.g., k-means ++)—treating eachmetric cluster as a separate point to cluster—to generate a new, smallerset of metric clusters in which one or more of the existing metricclusters have been merged together. For example, the clusteringalgorithm can take one or more passes (e.g., 1, 2, 3, etc.) on theexisting metric clusters to generate the new, smaller set of metricclusters. Metric clusters may be merged by the pipeline metric outlierdetector(s) 3408 hierarchically, meaning that two or more metricclusters can be merged together to form a single, merged metric clusterand one or more sets of metric clusters can be separately mergedtogether. The pipeline metric outlier detector(s) 3408 can re-assigngroups of pipeline metrics corresponding to the same time instant thatwere previously assigned to the metric clusters that were merged to themerged metric cluster. The pipeline metric outlier detector(s) 3408 canthen continue these operations for each subsequent group of pipelinemetrics corresponding to the same time instant that is obtained.

Because the number of metric clusters may be reduced after a mergeoperation, the pipeline metric outlier detector(s) 3408 can recalculatethe minimum cluster distance, and the recalculated minimum clusterdistance can be used by the pipeline metric outlier detector(s) 3408 infuture metric cluster assignment operations. In some embodiments, amerge operation causes the minimum cluster distance to increase giventhat fewer metric clusters remain Because the pipeline metric outlierdetector(s) 3408 creates a new metric cluster when the distance betweena group of pipeline metrics corresponding to the same time instant andthe closest metric cluster is greater than the minimum cluster distance,the increase in the minimum cluster distance from the merge operationmay inherently cause the number of new metric clusters being created toremain low. Thus, the number of metric clusters may gravitate towardbeing k rather than the threshold, increasing accuracy and reducingcomputational costs.

Because the data to cluster is known when clustering occurs offline(e.g., not in real-time, but sometime after data has been ingested andstored, such as periodically in batches), a traditional clusteringalgorithm can run multiple passes on the data and produce exactly k (orfewer) clusters. When attempting to cluster data online or in real-time(e.g., when attempting to assign groups of pipeline metricscorresponding to the same time instant to metric clusters online or inreal-time), data previously received is known, but the data to bereceived in the future is unknown. To use a traditional clusteringalgorithm, the pipeline metric outlier detector(s) 3408 would have toobtain the previously created groups of pipeline metrics correspondingto the same time instant and a group of pipeline metrics correspondingto the same time instant that was just obtained, and apply thetraditional clustering algorithm to these groups of pipeline metricscorresponding to the same time instant to obtain a new set of metricclusters to which the groups of pipeline metrics corresponding to thesame time instant are assigned. The pipeline metric outlier detector(s)3408 would then have to repeat these operations each time a new group ofpipeline metrics corresponding to the same time instant or a new set ofgroups of pipeline metrics corresponding to the same time instant arereceived. The pipeline metric outlier detector(s) 3408 described hereinare capable of assigning groups of pipeline metrics corresponding to thesame time instant to metric clusters in batches using a traditionalclustering algorithm (e.g., k-means clustering) in a manner as describedabove. It may be too computationally costly, however, for the pipelinemetric outlier detector(s) 3408 to generate new metric clusters andre-assign previously obtained groups of pipeline metrics correspondingto the same time instant to the new metric clusters each time a newgroup of pipeline metrics corresponding to the same time instant isreceived using a traditional clustering algorithm. As each new group ofpipeline metrics corresponding to the same time instant is received, thenumber of groups of pipeline metrics corresponding to the same timeinstant to assign to a metric cluster would grow. Over time, the latencyof the streaming data processor(s) 308 would increase, therebyincrementally increasing anomaly detection times.

The clustering algorithm described above as being implemented by thepipeline metric outlier detector(s) 3408, however, can allow thepipeline metric outlier detector(s) 3408 to accurately assign groups ofpipeline metrics corresponding to the same time instant to metricclusters online or in real-time without experiencing the incrementallyhigher delay or computational costs that would result from using atraditional clustering algorithm. To achieve this technical benefit, thepipeline metric outlier detector(s) 3408 may not necessarily createexactly k clusters or metric clusters. Rather, the pipeline metricoutlier detector(s) 3408 may maintain a number of metric clustersgreater than k and less than the threshold (e.g., a threshold that is onthe order of k log₁₀ n, where n is the number of groups of pipelinemetrics corresponding to the same time instant that have been receivedup to that point), with the number of metric clusters generally beingcloser to k than to the threshold. The pipeline metric outlierdetector(s) 3408 may maintain this number of metric clusters even aftera merge operation occurs. Thus, the pipeline metric outlier detector(s)3408 can create metric clusters, assign groups of pipeline metricscorresponding to the same time instant to metric clusters, and mergemetric clusters in real-time without being negatively affected by thedrawbacks associated with using a traditional clustering algorithm.

As described above, the streaming data processor(s) 308 can launchmultiple pipeline metric outlier detectors 3408 if the volume of theingested data exceeds a threshold and/or the cardinality of the ingesteddata exceeds a threshold. Typically, systems that process data inbatches have a training phase and a scoring phase. In the trainingphase, a training system can perform multiple passes on stored, knowndata to generate a model for processing future data. In the scoringphase, a production system can use the model to process ingested data.If the production system fails, the failure does not result in a loss ofthe model because the model is static. In other words, the productionsystem had not been updating the model based on the ingested data.Rather, the model used by the production system remained in the samestate as when the model was generated by the training system. A newproduction system can be instantiated to replace the failed productionsystem, and the model can simply be exported from the training system tothe new production system, allowing data processing to continue withouterror. When processing data online or in real-time, however, the modelis not static. Specifically, when processing data online or inreal-time, the data is constantly being streamed to the data ingestionpipeline. As a result, the data ingestion pipeline is continuouslyprocessing the streamed data, learning from the data as the data isstreamed and updating the model based on the learning. The model,therefore, is not static or a snapshot from a certain moment in time. Afailure of a task in the data ingestion pipeline could thus result in aloss of the most-recent model, thereby reducing the accuracy of the dataingestion pipeline processing. Launching multiple pipeline metricoutlier detectors 3408, however, can alleviate these issues, allowingthe data ingestion pipeline to constantly learn and be fault tolerantregardless of whether the volume of the ingested data exceeds athreshold and/or the cardinality of the ingested data exceeds athreshold. In fact, launching multiple pipeline metric outlier detectors3408 in the architecture described herein can allow the data ingestionpipeline to pause and upgrade the data ingestion pipeline logic (e.g.,incorporate new clustering algorithms (e.g., to improve clusteraccuracy) and/or incorporate new steps in the data ingestion pipeline(e.g., to make the pipeline more efficient)) without causing the dataingestion pipeline to re-learn the model. Rather, the pipeline metricoutlier detector(s) 3408 can continue to use the most-recently learnedmodel (e.g., the most-recently learned metric clusters) after theupgraded data ingestion pipeline logic is incorporated and the dataingestion pipeline resumes.

For example, the pipeline metric outlier detector(s) 3408 can beseparated into local pipeline metric outlier detectors 3408A-3404D and aglobal pipeline metric outlier detector 3408N, as shown in FIG. 34C. Inother words, the streaming data processor(s) 308 can launch multiplepipeline metric outlier detector 3408 tasks, with some pipeline metricoutlier detector 3408 task(s) operating as local task(s) and otherpipeline metric outlier detector 3408 task(s) operating as globaltask(s). The clustering algorithm described herein can be written suchthat the clustering algorithm can be distributed to the local pipelinemetric outlier detectors 3408A-3408D and/or the global pipeline metricoutlier detector 3408N such that each pipeline metric outlier detector3408A-3408D and 3408N can run the clustering algorithm. In addition, theclustering algorithm can be written such that execution of theclustering algorithm is fast (e.g., the number of requests per secondthat can be processed by the clustering algorithm is high), allowing alarger volume of data to be processed. While FIG. 34C depicts four localpipeline metric outlier detectors 3408A-3408D and one global pipelinemetric outlier detector 3408N, this is not meant to be limiting. Anynumber of local pipeline metric outlier detectors 3408 and/or globalpipeline metric outlier detectors 3408 may be launched by the streamingdata processor(s) 308.

The streaming data processor(s) 308 can launch one or more sets ofpipeline metric outlier detectors 3408A-3408D and 3408N, with each setprocessing ingested data for a user, a set of users, a device, a set ofdevices, a certain set of data, and/or the like. Each local pipelinemetric outlier detector 3408A-3408D can perform the same operations asdescribed above with respect to the pipeline metric outlier detector(s)3408. Specifically, a local pipeline metric outlier detector 3408A-3408Dcan assign a group of pipeline metrics corresponding to the same timeinstant to an existing metric cluster or a new metric cluster andperiodically merge metric clusters in a manner as described above.

The local pipeline metric outlier detectors 3408A-3408D, however, mayeach receive a different set of data. For example, the volume orcardinality of data may be large such that having one pipeline metricoutlier detector 3408A-3408D process all of the data may be toooverwhelming for the single pipeline metric outlier detector 3408A-3408Dto handle in a timely manner. Thus, the stream of ingested data can bebroken up into chunks and each local pipeline metric outlier detector3408A-3408D can process a portion of the stream (e.g., one or morechunks) rather than the entire stream. Specifically, each local pipelinemetric outlier detector 3408A-3408D can process a certain portion of theingested pipeline metrics. Accordingly, as illustrated in FIG. 34C, thelocal pipeline metric outlier detector 3408A receives ingested pipelinemetrics 1, the local pipeline metric outlier detector 3408B receivesingested pipeline metrics 2, the local pipeline metric outlier detector3408C receives ingested pipeline metrics 3, and the local pipelinemetric outlier detector 3404D receives ingested pipeline metrics 4 asthe data is ingested in real-time.

Because the local pipeline metric outlier detectors 3408A-3408D eachreceive a different set of data, the metric clusters created by eachlocal pipeline metric outlier detector 3408A-3408D may be different. Infact, the number of metric clusters created by each local pipelinemetric outlier detector 3408A-3408D at any given time may be differentgiven that the merge operations periodically performed by the localpipeline metric outlier detectors 3408A-3408D may result in differentlevels of metric cluster consolidation. As a result, the local pipelinemetric outlier detector 3408A may create a first metric cluster set, thelocal pipeline metric outlier detector 3408B may create a second metriccluster set, the local pipeline metric outlier detector 3408C may createa third metric cluster set, and the local pipeline metric outlierdetector 3408D may create a fourth metric cluster set.

As described above, each local pipeline metric outlier detector3408A-3408D does not process each ingested piece of data. Rather, eachlocal pipeline metric outlier detector 3408A-3408D processes a portionthereof. Thus, periodically, when a certain volume of data has beenprocessed, or when the number of metric clusters created by any or allof the local pipeline metric outlier detectors 3408A-3408D reaches athreshold (e.g., a threshold on the order of k log₁₀ n), the globalpipeline metric outlier detector 3408N can merge the metric clusterscreated by the individual local pipeline metric outlier detectors3408A-3408D to create a merged metric cluster set that is based on allof the ingested data to that point. For example, the global patternmatcher 3404 can use a clustering algorithm (e.g., k-means ++) to mergethe first, second, third, and fourth metric cluster sets—treating eachmetric cluster in the sets as a point to cluster—in a manner asdescribed above to create the merged metric cluster set. The mergedmetric cluster set may incorporate characteristics learned from all ofthe data ingested to that point rather than just a subset of the dataingested to that point and processed by an individual local pipelinemetric outlier detector 3408A-3408D, as is true with the first, second,third, and fourth metric cluster sets. The global pipeline metricoutlier detector 3408N can then feed the merged metric cluster set backto the individual local pipeline metric outlier detectors 3408A-3408D sothat the individual local pipeline metric outlier detectors 3408A-3408Dcan continue to process ingested data (e.g., assign groups of pipelinemetrics corresponding to the same time instant to metric clusters and/ormerge metric clusters) using the merged metric cluster set rather thanthe metric cluster set originally created by the individual localpipeline metric outlier detector 3408A-3408D. As the local pipelinemetric outlier detectors 3408A-3408D process newly ingested data (e.g.,assign groups of pipeline metrics corresponding to the same time instantto metric clusters and/or merge metric clusters) using the merged metriccluster set, each local pipeline metric outlier detector 3408A-3408D maymodify the merged metric cluster set in different ways. However, theglobal pipeline metric outlier detector 3408N can subsequently mergethese modified metric cluster sets and provide this most-recently mergedmetric cluster set to the local pipeline metric outlier detector(s)3408A-3408D for use in processing data ingested in the future (e.g., foruse in assigning groups of pipeline metrics corresponding to the sametime instant to metric clusters and/or merging metric clusters), and thecycle can continue. Thus, the architecture described herein includesnested merge operations, where the local pipeline metric outlierdetectors 3408A-3408D may each regularly perform merge operations ontheir own metric cluster sets in a manner as described herein, and thenthe global pipeline metric outlier detector 3408N can perform a mergeoperation on the metric cluster sets created by the local pipelinemetric outlier detectors 3408A-3408D periodically, when a certain volumeof data has been processed, or when the number of metric clusterscreated by any or all of the local pipeline metric outlier detectors3408A-3408D reaches a threshold. Alternatively, one or more of the localpipeline metric outlier detectors 3408A-3408D can merge the metriccluster sets created by the local pipeline metric outlier detectors3408A-3408D rather than the global pipeline metric outlier detector3408N (thereby resulting in the streaming data processor(s) 308declining to launch the global pipeline metric outlier detector 3408N).

Thus, the feedback architecture described herein ensures that thepipeline metric outlier detector(s) 3408A-3408D and 3408N are constantlylearning and producing updated or merged metric cluster sets. In fact,use of the local pipeline metric outlier detector(s) 3408A-3408D furtherincreases fault tolerance and allows for the data ingestion pipelinelogic to be upgraded without disruption to the data ingestion pipelineitself. For example, each algorithm implemented by and/or each model(e.g., metric cluster set) created by the local pipeline metric outlierdetector(s) 3408A-3408D and/or the global pipeline metric outlierdetector 3408N can be converted into, mapped to, and/or backed up by aFLINK operator (e.g., a stateful FLINK operator). Converting, mapping,or backing up the algorithms into FLINK operators can allow thealgorithms to run on local tasks (e.g., the local pipeline metricoutlier detectors 3408A-3408D). The FLINK operator (e.g., the statefulFLINK operator) may periodically store its state in a keyed state store.If a local pipeline metric outlier detector 3408A-3408D fails, thestreaming data processor(s) 308 can simply launch a new local pipelinemetric outlier detector 3408A-3408D to replace the failed local pipelinemetric outlier detector 3408A-3408D and retrieve the FLINK operatorcorresponding to the failed local pipeline metric outlier detector3408A-3408D from the keyed state store such that the algorithm and/ormodel (e.g., metric cluster set) represented by the FLINK operator canbe applied to the new local pipeline metric outlier detector3408A-3408D. In other words, the streaming data processor(s) 308 canrecreate the failed local pipeline metric outlier detector 3408A-3408Dusing the FLINK operator stored in the keyed state store. Applying thealgorithm and/or model represented by the FLINK operator to the newlocal pipeline metric outlier detector 3408A-3408D allows the new localpipeline metric outlier detector 3408A-3408D to operate using the backedup algorithm and/or model (e.g., metric cluster set), thereby allowingthe data ingestion pipeline to continue operations without losing thestate of the failed local pipeline metric outlier detector 3408A-3408D.

As another example, the FLINK operator may have a migration policy thatthe streaming data processor(s) 308 can use to determine whetherupgraded data ingestion pipeline logic (e.g., to replace or upgrade thealgorithm) is compatible with the models (e.g., metric clusters) createdby the local pipeline metric outlier detector(s) 3408A-3408D (e.g., todetermine whether upgraded data ingestion pipeline logic can read themodels). If the streaming data processor(s) 308 determine that theupgraded data ingestion pipeline logic is compatible with the models(e.g., metric clusters), the streaming data processor(s) 308 can pauseand/or refresh the data ingestion pipeline to incorporate the upgradeddata ingestion pipeline logic (which can include a new FLINK operatorrepresenting a new algorithm, a new pipeline step, etc.). The streamingdata processor(s) 308 can then resume the data ingestion pipeline fromthe previous state, using the previously learned models (e.g., the mostrecent set of metric clusters) and the upgraded data ingestion pipelinelogic (e.g., the new or upgraded clustering algorithm) to processingested data (e.g., pipeline metrics). Thus, the models do not need tobe re-learned when the data ingestion pipeline logic is upgraded.

4.15.1.4. Explaining Anomalies in Pipeline Metrics

The anomalous metric identifier 3410 can be configured to provideexplanations for anomalies detected in pipeline metrics based onpatterns observed in logs, such as job manager logs, task manager logs,and/or other type(s) of application logs. Specifically, the anomalousmetric identifier 3410 can correlate logs with metric outliers and usethe logs as a root cause analysis for explaining why a metric isobserved as an outlier.

For example, the pipeline metric outlier detector(s) 3408 can assigneach group of pipeline metrics corresponding to the same time instant anoutlier score. If the outlier score exceeds a threshold, this mayindicate that some or all of the pipeline metrics in the group areoutliers. Detection of outlier pipeline metrics may indicate that thereis an issue with a corresponding portion of the data ingestion pipeline.However, false positives can occur and some detected outliers actuallymay not indicate any issue with a corresponding portion of the dataingestion pipeline. The anomalous metric identifier 3410 can filter thefalse positives by observing whether any anomalies are detected in logsor in sequences of logs corresponding to the same time instant or timeperiod as a group of pipeline metrics flagged as being outliers. If ananomaly is detected in a log or in sequence of logs that corresponds tothe same time instant or time period as a group of pipeline metricsflagged as being outliers, this may increase the chances that thepipeline metrics are anomalous and not a false positive, and thereforethat there is an issue with the data ingestion pipeline that should beresolved.

As an illustrative example, the anomalous metric identifier 3410 canidentify anomalous logs or anomalous sequences of logs based on anomalyinformation provided by the anomaly detector 3406 (e.g., the anomalydetector 3406 can identify anomalous logs and/or anomalous sequences oflogs and provide this information to the anomalous metric identifier3410). Each anomalous log or anomalous sequence of logs may beassociated with a timestamp or range of timestamps and an anomaly score.Specifically, the anomaly score may be assigned by the anomaly detector3406 or the anomalous metric identifier 3410 and may be a distancebetween the anomalous log and the data pattern to which the anomalouslog is assigned or a distance between the anomalous sequence of logs andthe log sequence cluster to which the anomalous sequence of logs isassigned.

The anomalous metric identifier 3410 can, for a group of pipelinemetrics corresponding to the same time instant having an outlier score,identify an anomalous log that has a timestamp and/or an anomaloussequence of logs that have a range of timestamps corresponding to thetime instant of the group of pipeline metrics (e.g., a timestamp thatmatches the time instant, a range of timestamps in which the timeinstant falls, a timestamp that is within a threshold period of time ofthe time instant (e.g., a timestamp that is within 30 minutes of thetime instant), a range of timestamps that have at least one timestampthat is within a threshold period of time of the time instant (e.g., arange of timestamps in which at least one timestamp is within 30 minutesof the time instant), etc.). The anomalous metric identifier 3410 canthen calculate a weighted sum of the outlier score, the anomaly scorefor an anomalous log, and/or the anomaly score for an anomalous sequenceof logs. For example, the anomalous metric identifier 3410 can apply afirst weight to the outlier score, a second weight to the anomalous loganomaly score, and/or a third weight to the anomalous sequence of logsanomaly score. If the weighted sum exceeds a threshold, then theanomalous metric identifier 3410 determines that the group of pipelinemetrics corresponding to the same time instant is anomalous and is not afalse positive. Otherwise, if the weighted sum equals or does not exceedthe threshold, then the anomalous metric identifier 3410 determines thatthe group of pipeline metrics corresponding to the same time instant isnot an outlier or anomalous and/or is a false positive. The anomalousmetric identifier 3410 can adjust the weights applied to the differentscores over time based on user feedback received as to whether a log isanomalous, a sequence of logs is anomalous, and/or a pipeline metric isan outlier.

The anomalous metric identifier 3410 or another component in the dataintake and query system 108 can generate user interface data that, whenrendered by a client device 204, causes the client device 204 to displaya user interface depicting the anomalous group of pipeline metricscorresponding to the same time instant detected by the anomalous metricidentifier 3410, along with an explanation of why the group of pipelinemetrics corresponding to the same time instant has been flagged as beinganomalous. Specifically, the user interface can identify the anomalouslog and/or the anomalous sequence of logs that are correlated with theanomalous group of pipeline metrics (e.g., the anomalous log or anomalysequence of logs that correspond to the same time or time range as theanomalous group of pipeline metrics), and include a visual and/oraudible explanation that such anomalies in the logs or sequence of logsmay be the cause of the data ingestion pipeline issue indicated by theanomalous group of pipeline metrics. Alternatively or in addition, theanomalous metric identifier 3410 can generate an alert identifying theanomalous group of pipeline metrics and/or the possible cause of thedetected anomaly (e.g., an explanation that such anomalies in thecorrelated logs or sequence of logs may be the cause of the dataingestion pipeline issue indicated by the anomalous group of pipelinemetrics).

4.15.2. Data Pattern and Anomaly User Interfaces

FIG. 35 illustrates an example anomaly and pattern workbook view 3500rendered and displayed by the client browser 204 in which the anomalyand pattern workbook view 3500 depicts various information aboutanomalies detected by the anomaly detector 3406. In some embodiments,the anomaly and pattern workbook view 3500 includes a list 3501providing anomaly information and normal event information, a searchfield 3502, and a histogram 3504.

A user can enter a query in the search field 3502. The query, whenentered, may cause the query system 214 to run the query on eventscorresponding to the time range selected by the user via time field 3503and produce corresponding query results. The query results may beorganized as normal event information or anomalous event information anddepicted at least partially in the list 3501.

The histogram 3504 can depict various buckets. Each bucket maycorrespond to a time period within the selected time range. As anillustrative example, the time range selected via the time field 3503 isa 1 hour time range. Each bucket, therefore, may correspond to a 5minute time period within the 1 hour time range (e.g., a 5 minute timeperiod within 11:00 AM and 12:00 PM on October 11th), a 6 minute timeperiod within the 1 hour time range, a 10 minute time period within the1 hour time range, or the like. The height of a bucket may correspond toa number of events corresponding to the time period (e.g., a number ofevents that occurred during the time period). The histogram 3504 mayfurther include badges tagged to or otherwise associated with a bucket,such as badge 3505, that indicate the number of anomalous eventsdetected by the anomaly detector 3406 that occurred within the timeperiod of the associated bucket.

A user may expand the list 3501 to show anomaly information and normalevent information or contract the list 3501 to hide the anomalyinformation and normal event information. When expanded, each row in thelist 3501 can either depict information for a particular type ofanomalous event or information for a particular type of normal event.For example, the information for an anomalous event can include a numberof anomalous events detected by the anomaly detector 3406 for the timeperiod selected via time field 3503 that have the same data pattern(e.g., 5 for the first type of anomalous event listed in the list 3501),a histogram 3506 highlighting in which bucket(s) (e.g., in which timeperiods) the anomalous events of the same data pattern fall, anidentification of a data pattern shared by the anomalous eventscorresponding to the row (e.g., “<*> RAS KERNEL INFO <*> ddr error(s)detected and corrected on rank 0, symbol <*> bit <*>,” as depicted inthe first row of the list 3501), and a user-selectable action button inwhich the user can indicate whether the type of anomalous event isinteresting (e.g., potentially an actual anomalous event) or notinterested (e.g., not an actual anomalous event). If the user indicatesthat the type of anomalous event is interesting or not interesting, theselection made by the user can be submitted from the client device 204to the anomaly detector 3406. The anomaly detector 3406 can then usethis user feedback to improve future anomaly detections.

Alternatively, instead of depicting the histogram 3506, the anomaly andpattern workbook view 3500 can depict a box chart, such as a box andwhisker chart, that illustrates a range of token values that areconsidered normal and a range of token values that are consideredabnormal or anomalous (e.g., those token values that fall outside of thewhisker portion of the box and whisker chart). Given that the anomalyand pattern workbook view 3500 has a finite amount of space, the boxchart may initially show a range of normal values and/or identify thepositions of values considered anomalous. Upon the user selecting thebox chart, a larger box chart may appear in the anomaly and patternworkbook view 3500 (e.g., in a pop-up window) that shows the full rangeof normal values and anomalous values. In further embodiments, theinformation for an anomalous event can include other statistics, such asaverage token values, median token values, mode token values, thestandard deviation of token values, the variance of token values, and/orthe like.

As another alternative, instead of depicting the histogram 3506, theanomaly and pattern workbook view 3500 can depict a distribution graphshowing the distribution of token values that are considered normal.Selection of the distribution graph may cause the anomaly and patternworkbook view 3500 to depict (e.g., in a pop-up window) a largerdistribution graph showing the distribution of token values that areconsidered normal and the token values that are considered abnormal oranomalous.

In some embodiments, if the anomaly detector 3406 flags an event aspotentially being anomalous because the data pattern assigned to theevent is potentially anomalous, the list 3501 can further include abadge indicating that the type of anomalous event has been flaggedbecause the pattern is new and potentially anomalous. For example, asillustrated in FIG. 35 , the last type of anomalous event included inthe list 3501 includes a badge 3507 indicating that the type ofanomalous event has been flagged as being anomalous because the datapattern assigned to the type of event is new and may be anomalous. Ifthis type of badge, such as the badge 3507, is not present in a row,this may indicate that the anomaly detector 3406 flagged the type ofevent as potentially being anomalous because at least one of the tokenvalues of the event may be anomalous.

A user can further filter the types of anomalous events shown to justthose corresponding to a particular bucket or set of buckets in thehistogram 3504. For example, each of the buckets in the histogram 3504may be selectable. Selection of bucket 3510, for example, may cause thelist 3501 to update to only show some or all of the six anomalies thatcorrespond to the bucket 3510. If the user then selects bucket 3511, forexample, then the list 3501 may be updated to show only some or all ofthe six anomalies that correspond to the bucket 3510 and/or some or allof the four anomalies corresponding to the bucket 3511. Anotherselection of the bucket 3510, however, may cause the list 3501 to beupdated again to show only some or all of the four anomaliescorresponding to the bucket 3511.

By grouping similar anomalous events by the events that share a datapattern, the anomaly and pattern workbook view 3500 can compressadditional data into the finite amount of space available on a screen.In fact, the anomaly and pattern workbook view 3500 can refrain fromshowing information about specific anomalous events that areuninteresting to a user via this grouping. Likewise, the client device204 can avoid rendering information about specific anomalous events thatare uninteresting to a user via this grouping, thereby allowing theclient device 204 to allocate computing resources for other operations.

In addition, the anomaly and pattern workbook view 3500 includes a rawdata/pattern toggle button 3509, which allows a user to toggle betweenviewing raw, ingested data and the ingested data organized into patterns(as depicted in FIG. 35 ). Thus, a user can switch between viewing theraw ingested data and the ingested data organized into patterns withinthe same view 3500 without having to select and view different tabs orwindows. Accordingly, the anomaly and pattern workbook view 3500provides a single interface that depicts multiple types of informationwithin the same window, reducing the number of navigational steps that auser may have to perform to view such information.

If a user elects to expand one of the rows in the list 3501, the anomalyand pattern workbook view 3500 can be updated to show the specificanomalous events corresponding to the row (e.g., the specific anomalousevents that each share the same data pattern). For example, FIG. 36illustrates an example anomaly and pattern workbook view 3600 renderedand displayed by the client browser 204 in which the user has elected toexpand carrot 3508 to show the specific anomalous events correspondingto the first row in the list 3501.

As described herein, a data pattern can include zero or more wildcardsthat represent various token values. When the carrot 3508 is expanded,however, the list 3501 may be updated to include additional sub-rows,where each sub-row shows an anomalous event assigned to the same datapattern, including the individual token values of the anomalous eventrepresented by the wildcard(s) in the data pattern.

In some embodiments, each sub-row also includes additional actions thatmay be selected by a user. For example, the user can select to viewevents surrounding the subject anomalous event and/or to indicatewhether the event is actually anomalous. If the user indicates that theevent is or is not anomalous, the selection made by the user can besubmitted from the client device 204 to the anomaly detector 3406. Theanomaly detector 3406 can then use this user feedback to improve futureanomaly detections.

If a user elects to view events surrounding the subject anomalous event,the anomaly and pattern workbook view 3600 can be updated to show eventsthat occurred before and/or after the subject anomalous event. Forexample, FIG. 37 illustrates an example anomaly and pattern workbookview 3700 rendered and displayed by the client browser 204 in which theuser has elected to view events surrounding a particular anomalousevent. In response to this selection, a pop-up window 3701 may appear inthe anomaly and pattern workbook view 3700 in which a series of eventsare depicted in chronological order. The anomalous event for which auser is attempting to view surrounding events may be depicted near or atthe center of the pop-up window 3701, and events that occurred beforethe anomalous event may be listed above the anomalous event and eventsthat occurred after the anomalous event may be listed after theanomalous event.

In some embodiments, the user can adjust the time period during whichevents that occurred are surfaced and depicted in the pop-up window3701. For example, a user can adjust the time period via time field3702. Thus, if as depicted in FIG. 37 , the user selects a time periodof +/−1 minute, then some or all of the events that occurred 1 minutebefore the anomalous event may be listed above the anomalous event andsome or all of the events that occurred 1 minute after the anomalousevent may be listed below the anomalous event.

As with the specific anomalous events listed in the sub-row of theanomaly and pattern workbook view 3601, a user may be able to indicatewhether the anomalous event is actually anomalous and/or whether thesurrounding events are actually anomalous via the pop-up window 3701. Ifthe user indicates that any event is or is not anomalous, the selectionmade by the user can be submitted from the client device 204 to theanomaly detector 3406. The anomaly detector 3406 can then use this userfeedback to improve future anomaly detections.

As described above, the list 3501 provides anomalous event informationand normal event information. For example, FIG. 38 illustrates anexample anomaly and pattern workbook view 3800 rendered and displayed bythe client browser 204 in which the user has hidden the anomalous eventinformation and expanded the normal event information. In particular,the user has contracted carrot 3801—which when expanded shows anomalousevent information—and expanded carrot 3802 to show the normal eventinformation.

In some embodiments, expansion of the carrot 3802 and/or contraction ofthe carrot 3801 causes the list 3501 to be updated to show the normalevent information. As with the anomalous event information, the normalevent information can include a number of anomalous normal events forthe time period selected via the time field 3503 that have the same datapattern (e.g., 200 for the first type of normal event listed in theupdated list 3501), a histogram 3806 highlighting in which bucket(s)(e.g., in which time periods) the normal events of the same data patternfall, an identification of a data pattern shared by the normal eventscorresponding to the row (e.g., “<*> RAS KERNEL INFO <*> ddr error(s)detected and corrected on rank 0, symbol <*> bit <*>,” as depicted inthe first row of the updated list 3501), and user-selectable actionbuttons in which the user can elect to view events surrounding thenormal events and/or indicate whether the type of normal events are orare not anomalous. If the user indicates that the type of normal eventsare or are not anomalous, the selection made by the user can besubmitted from the client device 204 to the anomaly detector 3406. Theanomaly detector 3406 can then use this user feedback to improve futureanomaly detections.

FIG. 39 illustrates an example pattern catalog view 3900 rendered anddisplayed by the client browser 204 in which events that match or areotherwise assigned to a certain data pattern are displayed. For example,in response to a data pattern submitted to the query system 214, thequery system 214 can use the data store catalog 220 to identify datastored in the common storage 216 that corresponds to the data pattern.In particular, the user can provide the data pattern to identify eventsthat match the user-entered data pattern. The user, however, may notneed to submit or enter a query that is processed by the query system214. Rather, the information displayed in the pattern catalog view 3900can be presented without a query being entered by the user orauto-generated by the system.

As illustrated in FIG. 39 , the user has entered the data pattern “<*>RAS KERNEL INFO <*> ddr error(s) detected and corrected on rank 0,symbol <*> bit <*>” as the data pattern for which events that match orare otherwise assigned to the data pattern are to be displayed. The user(or system) can also select a time range for which events matching orotherwise assigned to the entered data pattern are surfaced (e.g., bythe query system 214) and displayed in pop-up window 3901 via time field3902.

The pop-up window 3901 can display a histogram 3903 indicating thenumber of events that match or are otherwise assigned to the entereddata pattern that occurred at or correspond to a certain time periodwithin the time range selected via the time field 3902. For example,each bar in the histogram 3903 may represent a 1 second time period, a 5second time period, a 10 second time period, or the like.

The pop-up window 3901 can further display a list 3904 of the specificevents that match or are otherwise assigned to the entered data pattern.The list 3904 can include a time at which the event occurred and thespecific token values that comprise the event.

FIG. 40 illustrates another example pattern catalog view 4000 renderedand displayed by the client browser 204 in which trends in eventoccurrences and/or event anomaly detections are displayed. Asillustrated in FIG. 40 , the user can select a time range for whichtrend information is to be displayed in pop-up window 4001 via timefield 4002. As with the pattern catalog view 4000, the informationdisplayed in the pattern catalog view 4000 can be presented without aquery being entered by the user or auto-generated by the system.

The pop-up window 4001 can further include a list 4003 in which trendinformation is provided. For example, the trend information can includea count of a number of events that match or are otherwise assigned to aparticular data pattern, a number of events that match the particulardata pattern in which anomalies are detected by the anomaly detector3406, a percentage change in the number of events that match or areotherwise assigned to the particular data pattern (e.g., as compared toone or more previous time ranges, over time during the selected timerange, etc.) and/or the percentage change in the number of anomalousevents that match or are otherwise assigned to the particular datapattern (e.g., as compared to one or more previous time ranges, overtime during the selected time range, etc.), and an identification of theparticular data pattern. Optionally, the list 4003 can includeuser-selectable action items, such as the ability to indicate whetherthe data pattern is interesting or not interesting.

Alternatively or in addition, the pattern catalog view 4000 can includea trendline graph showing the trends of the counts of various datapatterns and/or anomalous events within the data patterns over a periodof time. For example, the trendline graph can be included in the pop-upwindow 4001 in place of the list 4003. The trendline graph can includethe trends of all data patterns or a subset of the data patterns (e.g.,the top 5 data patterns).

FIG. 51 illustrates another example anomaly and pattern workbook view5100 rendered and displayed by the client browser 204 in which theanomaly and pattern workbook view 5100 depicts various information aboutanomalies detected by the anomaly detector 3406. In some embodiments,the anomaly and pattern workbook view 5100 includes selectable elements5109-5111 that allow a user to view information on all events thatoccurred during the time range selected via the time field 3503, to viewanomalies detected during the time range selected via the time field3503, and/or to view data patterns detected during the time rangeselected via the time field 3503. The element 5109 may indicate a totalnumber of events that were detected and, when selected, may allow a userto view information on all events. The element 5110 may indicate a totalnumber of anomalies that were detected and a number of data patterns inwhich anomalies are detected and, when selected, may allow a user toview detected anomalies. The element 5111 may indicate a total detectednumber of data patterns, a detected number of anomalous data patterns,and a detected number of normal data patterns and, when selected, mayallow a user to view detected data patterns.

As illustrated in FIG. 51 , the element 5109 is selected, which causeslist 5101 to display information about some or all of the events thatoccurred during the time range selected via the time field 3503. In someimplementations, the list 5101 displays, in each row, a time that anevent occurred, a data pattern of the event (or the event itself), anduser-selectable action buttons in which the user can view surroundingevents and/or indicate whether the event is anomalous. Events that areanomalous may also be indicated in the list 5101. For example, events,such as event 5112, may be bolded, colored differently, highlighted, orotherwise marked to indicate that the event is anomalous.

FIGS. 52A-52B illustrate other example anomaly and pattern workbookviews 5200 and 5250 rendered and displayed by the client browser 204 inwhich the anomaly and pattern workbook views 5200 and 5250 depictvarious information about anomalies detected by the anomaly detector3406. As illustrated in FIGS. 52A-52B, the element 5110 is selected,which causes list 5201 of the anomaly and pattern workbook views 5200and 5250 to display information about anomalies detected during the timerange selected via the time field 3503.

In some implementations, the list 5201 displays, in each row, a count ofa number of anomalies that have been detected in association with thedata pattern corresponding to the respective row; a percentage of theevents corresponding to the data pattern corresponding to the respectiverow that are detected to be anomalous; a graph showing a distribution ofevents corresponding to the data pattern corresponding to the respectiverow, with an indication of a portion of the graph considered anomalous,if applicable (e.g., the shaded portion of the graph may be consideredanomalous); a type of anomalous event or data pattern corresponding tothe respective row; and a user-selectable action button in which theuser can indicate whether the data pattern is interesting. Wildcards orother portions of a data pattern that correspond to an anomalous tokenvalue may be bolded, colored differently, highlighted, or otherwisemarked to indicate that the wildcard or data pattern portion correspondsto at least one anomalous token value. For example, row 5212 correspondsto the data pattern “<*> RAS KERNEL INFO <*> ddr error(s) detected andcorrected on rank 0, symbol <*> bit <*>.” This data pattern includesseveral wildcards, but not all of the wildcards correspond to anomaloustoken values. Rather, wildcards 5213 and 5214 correspond to anomaloustoken values, whereas the other wildcards of the data pattern do notcorrespond to any anomalous token values.

As illustrated in FIG. 52A, the graphs included in each row may bedistribution graph showing a distribution of events corresponding to thedata pattern corresponding to the respective row, with an indication ofa portion of the distribution graph considered anomalous (e.g., theshaded portion of the distribution graph may be considered anomalous).As illustrated in FIG. 52B, the graphs included in each row may bedependent on the type of token values associated with the data patternof the respective row. For example, a distribution graph may be shown inthe row if the type of token values associated with the data pattern arenumerical, whereas a histogram may be shown in the row if the type oftoken values associated with the data pattern are categorical. Othertypes of graphs may be shown in the row without limitation. In someimplementations, the row may indicate a series of graphs that areassociated with the data pattern corresponding to the respective row,where each graph corresponds to one of the token values of the datapattern. In particular, any given data pattern might have multiple (sameor different) visualizations because of the types of token valuescorresponding to the data pattern. Thus, a row may display an indicationthat multiple graphs are present, with the graphs all being distributiongraphs (e.g., if the type of token values associated with the datapattern are all numerical), all being histograms (e.g., if the type oftoken values associated with the data pattern are all categorical), or acombination thereof (e.g., if some token value types associated with thedata pattern are numerical, whereas other token value types associatedwith the data pattern are categorical).

In some embodiments, if the anomaly detector 3406 flags an event aspotentially being anomalous because the data pattern assigned to theevent is potentially anomalous, the list 5201 can further include abadge indicating that the type of anomalous event has been flaggedbecause the data pattern is new and potentially anomalous. For example,as illustrated in FIGS. 52A-52B, the last type of anomalous eventincluded in the list 5201 includes a badge 5207 indicating that the typeof anomalous event has been flagged as being anomalous because the datapattern assigned to the type of event is new and may be anomalous. Ifthis type of badge, such as the badge 5207, is not present in a row,this may indicate that the anomaly detector 3406 flagged the type ofevent as potentially being anomalous because at least one of the tokenvalues of the event may be anomalous.

FIGS. 53A-53B illustrate other example anomaly and pattern workbookviews 5300 and 5350 rendered and displayed by the client browser 204 inwhich the anomaly and pattern workbook views 5300 and 5350 depictvarious information about anomalies detected by the anomaly detector3406. As illustrated in FIGS. 53A-53B, the element 5111 is selected,which causes list 5301 of the anomaly and pattern workbook views 5300and 5350 to display information about data patterns detected during thetime range selected via the time field 3503.

In some implementations, the list 5301 displays, in each row, a count ofa number of times a data pattern corresponding to the respective row hasbeen detected; a percentage of all of the times a data pattern isdetected during the time range selected via the time field 3503 thatmatch the data pattern of the respective row; a graph showing adistribution of events corresponding to the data pattern correspondingto the respective row, optionally with an indication of a portion of thegraph considered anomalous, if applicable (e.g., the shaded portion ofthe graph may be considered anomalous); a data pattern corresponding tothe respective row; and a user-selectable action button in which theuser can indicate whether the pattern is interesting. Wildcards of adata pattern may be bolded, colored differently, highlighted, orotherwise marked to indicate that multiple token values correspond tothe wildcard.

As illustrated in FIG. 53A, the graphs included in each row may bedistribution graph showing a distribution of events corresponding to thedata pattern corresponding to the respective row. As illustrated in FIG.53B, the graphs included in each row may be dependent on the type oftoken values associated with the data pattern of the respective row. Forexample, a distribution graph may be shown in the row if the type oftoken values associated with the data pattern are numerical, whereas ahistogram may be shown in the row if the type of token values associatedwith the data pattern are categorical. Other types of graphs may beshown in the row without limitation. In some implementations, the rowmay indicate a series of graphs that are associated with the datapattern corresponding to the respective row, where each graphcorresponds to one of the token values of the data pattern. Inparticular, any given data pattern might have multiple (same ordifferent) visualizations because of the types of token valuescorresponding to the data pattern. Thus, a row may display an indicationthat multiple graphs are present, with the graphs all being distributiongraphs (e.g., if the type of token values associated with the datapattern are all numerical), all being histograms (e.g., if the type oftoken values associated with the data pattern are all categorical), or acombination thereof (e.g., if some token value types associated with thedata pattern are numerical, whereas other token value types associatedwith the data pattern are categorical).

FIGS. 54A-54B illustrate other example anomaly and pattern workbookviews 5400 and 5450 rendered and displayed by the client browser 204 inwhich the anomaly and pattern workbook views 5400 and 5450 depictvarious information about anomalies detected by the anomaly detector3406. As illustrated in FIGS. 54A-54B, the element 5110 is selected. Inaddition, bucket 3510 in the histogram 3504 is selected. As a result,list 5401 of the anomaly and pattern workbook views 5400 and 5450displays information about detected anomalies corresponding to thebucket 3510 (e.g., anomalies detected during a portion of the time rangeselected via the time field 3503 corresponding to the bucket 3510).

Upon selection of the bucket 3510, the element 5109 may update toindicate the number of total events that were detected or that occurredduring a portion of the time range selected via the time field 3503corresponding to the bucket 3510, the element 5110 may update toindicate the number of anomalies that were detected during a portion ofthe time range selected via the time field 3503 corresponding to thebucket 3510, and the element 5111 may update to indicate the number ofpatterns that were detected during a portion of the time range selectedvia the time field 3503 corresponding to the bucket 3510.

A row in the list 5401 can be selected to show additional informationabout the corresponding anomaly. FIGS. 55A-55B illustrate other exampleanomaly and pattern workbook views 5500 and 5550 rendered and displayedby the client browser 204 in which the anomaly and pattern workbookviews 5500 and 5550 depict various information about anomalies detectedby the anomaly detector 3406 during the time range corresponding to thebucket 3510. As illustrated in FIGS. 55A-55B, row 5412 is selected,which causes the list 5401 to show specific events 5501 that match thedata pattern of the row 5412. In particular, each of the events 5501includes the token values that correspond to the wildcards of the datapattern of the row 5412.

FIGS. 56-58 illustrate other example anomaly and pattern workbook views5600, 5700, and 5800 rendered and displayed by the client browser 204 inwhich the anomaly and pattern workbook views 5600, 5700, and 5800 depictmore detailed information about anomalies detected by the anomalydetector 3406. As illustrated in FIG. 56 , a user may select a datapattern or specific event from any of the anomaly and pattern workbookviews described herein. In response, the anomaly and pattern workbookview 5600 may display a pop-up window 5601 identifying the selected datapattern.

Some or all of the wildcards of the pattern identified in the pop-upwindow 5601 may be selectable. In addition, the wildcards may be bolded,colored differently, highlighted, or otherwise marked to indicate whichwildcards correspond to anomalous token values and which do notcorrespond to anomalous token values. For example, wildcard 5602 of thedata pattern may be selected. The wildcard 5602 may correspond types oftoken values that are numerical. As a result, the pop-up window 5601 maydisplay a distribution graph 5603 and properties of the distribution ofthe token values corresponding to the selected wildcard 5602. Forexample, the properties can include median token values corresponding tothe selected wildcard 5602, minimum and/or maximum token valuescorresponding to the selected wildcard 5602, a standard deviation oftoken values corresponding to the selected wildcard 5602, an averagetoken value corresponding to the selected wildcard 5602, a mode of thetoken values corresponding to the selected wildcard 5602, and/or anumber of anomalous token values corresponding to the selected wildcard5602.

The distribution graph 5603 may indicate visually where the median tokenvalue falls on the distribution and a portion 5604 of the distributiongraph 5603 in which anomalous token values fall (e.g., represented bymarkers 5605-5607). List 5608 may further indicate specific events thatinclude anomalous token values corresponding to the selected wildcard5602 and/or that do not include anomalous token values corresponding tothe selected wildcard 5602. The token values may be bolded, coloreddifferently, highlighted, or otherwise marked to indicate which tokenvalues correspond to the selected wildcard 5602.

As illustrated in FIG. 57 , a user may select a different wildcard 5702from the data pattern identified in the pop-up window 5601. The wildcard5702 may correspond types of token values that are numerical. As aresult, the pop-up window 5601 may display a distribution graph 5703 andproperties of the distribution of the token values corresponding to theselected wildcard 5702.

The distribution graph 5703 may indicate visually where the median tokenvalue falls on the distribution and a portion 5704 of the distributiongraph 5703 in which anomalous token values fall (e.g., represented bymarkers 5705 and 5706). The list 5608 may further be updated to indicatespecific events that include anomalous token values corresponding to theselected wildcard 5702 and/or that do not include anomalous token valuescorresponding to the selected wildcard 5702. The token values may bebolded, colored differently, highlighted, or otherwise marked toindicate which token values correspond to the selected wildcard 5702.

As illustrated in FIG. 58 , a user may select a different data pattern,which causes pop-up window 5801 to appear. The user may further selectwildcard 5802 from the data pattern identified in the pop-up window5801. The wildcard 5802 may correspond types of token values that arecategorical. As a result, the pop-up window 5801 may display a histogram5803 and properties of the histogram, such as the number of anomaliescorresponding to the selected wildcard 5802. If the selected wildcard5802 corresponds to at least one anomalous token value, then one or morebuckets of the histogram 5803 corresponding to the anomalous tokenvalue(s) may be shaded, colored differently, highlighted, or otherwisemarked to indicate which bucket(s) correspond to anomalous tokenvalue(s). In FIG. 58 , no anomalous token values correspond to theselected wildcard 5802, and therefore no buckets in histogram 5803 areso marked.

List 5808 may indicate specific events that include anomalous tokenvalues corresponding to the selected wildcard 5802 and/or that do notinclude anomalous token values corresponding to the selected wildcard5802. The token values may be bolded, colored differently, highlighted,or otherwise marked to indicate which token values correspond to theselected wildcard 5802.

If a user elects to view events surrounding the subject anomalous event,any of the anomaly and pattern workbook views described herein can beupdated to show events that occurred before and/or after the subjectanomalous event. For example, FIG. 59 illustrates an example anomaly andpattern workbook view 5900 rendered and displayed by the client browser204 in which the user has elected to view events surrounding aparticular anomalous event. In response to this selection, a pop-upwindow 5901 may appear in the anomaly and pattern workbook view 5900 inwhich a series of events are depicted in chronological order. Theanomalous event for which a user is attempting to view surroundingevents may be depicted near or at the center of the pop-up window 5901,and events that occurred before the anomalous event may be listed abovethe anomalous event and events that occurred after the anomalous eventmay be listed after the anomalous event.

In some embodiments, the user can adjust the time period during whichevents that occurred are surfaced and depicted in the pop-up window5901. For example, a user can adjust the time period via time field5902. Thus, if as depicted in FIG. 59 , the user selects a time periodof +/−1 minute, then some or all of the events that occurred 1 minutebefore the anomalous event may be listed above the anomalous event andsome or all of the events that occurred 1 minute after the anomalousevent may be listed below the anomalous event.

A user may be able to indicate whether the anomalous event is actuallyanomalous and/or whether the surrounding events are actually anomalousvia the pop-up window 5901. If the user indicates that any event is oris not anomalous, the selection made by the user can be submitted fromthe client device 204 to the anomaly detector 3406. The anomaly detector3406 can then use this user feedback to improve future anomalydetections. A user may also be able to see a graph (e.g., a distributiongraph, histogram, etc.) corresponding to the event that may differ basedon the types of token values that comprise the event.

4.15.3. Anomalous Log Detection Routines

FIG. 41 is a flow diagram illustrative of an embodiment of a routine4100 implemented by the streaming data processor 308 to detect ananomalous log. Although described as being implemented by the streamingdata processor 308, it will be understood that the elements outlined forroutine 4100 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the streaming data processor 308. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 4102, one or more tokens are extracted from raw machine data.For example, the one or more tokens can be comprised within a vector(e.g., a string vector). The raw machine data can be job manager and/ortask manager logs and/or other type(s) of application logs that areingested and parsed to identify delimiters in the data. The delimitersmay be considered to separate tokens, and the individual tokens can beextracted and inserted as elements of a comparable data structure (e.g.,a vector, such as a string vector).

At block 4104, the one or more tokens are compared to one or morepatterns. For example, the pattern matcher(s) 3404 can identify thelength of the string vector (e.g., identify the number of elements ortokens that comprise the string vector) and identify zero or more datapatterns that have the same length as the string vector. The patternmatcher(s) 3404 can then compare the string vector to just those datapatterns having the same length. The comparison can include identifyingwhether the first token of the string vector matches the first token ofa data pattern, whether the second token of the string vector matchesthe second token of a data pattern, and so on.

At block 4106, a determination is made that the one or more tokenscorrespond to a first pattern. For example, the pattern matcher(s) 3404can determine that the string vector corresponds to the first patternbecause the string vector has the highest match rate with the firstpattern (e.g., more of the string vector tokens match the first patterntokens than the tokens of other data patterns).

At block 4108, a determination is made that the one or more tokens donot completely match the first pattern. For example, the patternmatcher(s) 3404 may determine that while the string vector correspondsto the first pattern, the pattern matcher(s) 3404 may determine that thefirst pattern does not completely describe the string vector. The firstpattern may not completely describe the string vector because, forexample, one token value of the string vector (e.g., “74”) is not equalto a corresponding token value of the first pattern (e.g., “100”).

At block 4110, the first pattern is updated to include a wildcard. Forexample, the pattern matcher(s) 3404 can update the first pattern toinclude a wildcard instead of a token value for the token value thatdoes not match the corresponding token value of the string vector. Inthis way, the first pattern can be updated to include a wildcard so thatthe first pattern now completely describes the string vector.

At block 4112, a first token of the first pattern is analyzed todetermine percentiles of values. In other words, the first token of thefirst pattern can be analyzed to determine a distribution of valuescorresponding to the first token. For example, the first token of thefirst pattern may be a wildcard. The anomaly detector 3406 can identifyall of the token values that are represented by the wildcard, anddetermine the percentiles of these token values or other statistics.

At block 4114, an anomaly value is detected based on values that fallbelow or above a threshold percentile. For example, the anomaly detector3406 can determine that a comparable data structure that has a tokenvalue corresponding to the first token of the first pattern that fallsbelow a certain percentile or that falls above a certain percentile maybe anomalous. As a result, the comparable data structure can be flaggedas being anomalous for having at least one token value that appears tobe anomalous. A user can subsequently confirm whether the detectedanomalous token value is actually anomalous to improve future anomalydetections.

Fewer, more, or different blocks can be used as part of the routine4100. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 41 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 42 is a flow diagram illustrative of an embodiment of a routine4200 implemented by the streaming data processor 308 to determinewhether a comparable data structure should be assigned to a datapattern. Although described as being implemented by the streaming dataprocessor 308, it will be understood that the elements outlined forroutine 4200 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the streaming data processor 308. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 4202, a number of tokens in a vector that match tokens of afirst pattern are counted. For example, the pattern matcher(s) 3404 canwalk through a string vector, token by token, and compare each token tothe corresponding token in the first pattern. A token in the stringvector matches a token in the first pattern if the token values areequal or if the token value in the first pattern is a wildcard.

At block 4204, the number of matching tokens is compared to a threshold.Optionally, the number of matching tokens may be divided by the lengthof the string vector (or the length of the first pattern) before beingcompared to the threshold.

At block 4206, a determination is made that the vector corresponds tothe first pattern in response to the number of matching tokenssatisfying the threshold. For example, the pattern matcher(s) 3404 maydetermine that the string vector corresponds to the first pattern if thenumber of matching tokens (or the number of matching tokens divided bythe length of the string vector or first pattern) is greater than orequal to the threshold. In further embodiments, the pattern matcher(s)3404 determines that the string vector corresponds to the first patternif the number of matching tokens (or the number of matching tokensdivided by a length) is greater than or equal to the threshold and ishigher than the number of matching tokens (or the number of matchingtokens divided by a length) resulting from a comparison with other datapatterns.

Fewer, more, or different blocks can be used as part of the routine4200. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 42 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 43 is a flow diagram illustrative of an embodiment of a routine4300 implemented by the streaming data processor 308 to assign acomparable data structure to a data pattern in real-time. Althoughdescribed as being implemented by the streaming data processor 308, itwill be understood that the elements outlined for routine 4300 can beimplemented by one or more computing devices/components that areassociated with the intake system 210, such as, but not limited to, thestreaming data processor 308. Thus, the following illustrativeembodiment should not be construed as limiting.

At block 4302, one or more tokens are extracted from raw machine data.For example, the one or more tokens can be comprised within a vector(e.g., a string vector). The raw machine data can be ingested and parsedto identify delimiters in the data. The delimiters may be considered toseparate tokens, and the individual tokens can be extracted and insertedas elements of a comparable data structure (e.g., a vector, such as astring vector).

At block 4304, the one or more tokens are compared to a first set ofpatterns. For example, the pattern matcher(s) 3404 can identify thelength of the string vector (e.g., identify the number of elements ortokens that comprise the string vector) and identify zero or more datapatterns in the first set that have the same length as the stringvector. The pattern matcher(s) 3404 can then compare the string vectorto just those data patterns in the first set having the same length. Thecomparison can include identifying whether the first token of the stringvector matches the first token of a data pattern, whether the secondtoken of the string vector matches the second token of a data pattern,and so on.

At block 4306, the one or more tokens are assigned to a new patternbased on a distance between the one or more tokens and each pattern inthe first set being greater than a minimum cluster distance. Forexample, the minimum cluster distance may be the minimum distancebetween any two data patterns in the first set. The distance between theone or more tokens and each pattern may be a distance between the vectorand a centroid of each pattern.

At block 4308, the minimum cluster distance is updated based on thecreation of the new pattern. For example, the new pattern may beassociated with the first set of patterns. Thus, the pattern matcher(s)3404 can determine whether the distance between the new pattern and anyof the existing patterns in the first set is less than the minimumcluster distance. If none of the distances between the new pattern andthe existing patterns is less than the minimum cluster distance, thenthe pattern matcher(s) 3404 may keep the minimum cluster distance as thesame value. However, if at least one of the distances between the newpattern and the existing patterns is less than the minimum clusterdistance, then the minimum cluster distance may be updated by thepattern matcher(s) 3404 to be the lowest of the distances less than theprevious minimum cluster distance.

Fewer, more, or different blocks can be used as part of the routine4300. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 43 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 44 is another flow diagram illustrative of an embodiment of aroutine 4400 implemented by the streaming data processor 308 to assign acomparable data structure to a data pattern in real-time. Althoughdescribed as being implemented by the streaming data processor 308, itwill be understood that the elements outlined for routine 4400 can beimplemented by one or more computing devices/components that areassociated with the intake system 210, such as, but not limited to, thestreaming data processor 308. Thus, the following illustrativeembodiment should not be construed as limiting.

At block 4402, one or more tokens are extracted from raw machine data.For example, the one or more tokens can be comprised within a vector(e.g., a string vector). The raw machine data can be ingested and parsedto identify delimiters in the data. The delimiters may be considered toseparate tokens, and the individual tokens can be extracted and insertedas elements of a comparable data structure (e.g., a vector, such as astring vector).

At block 4404, the one or more tokens are compared to a first set ofpatterns. For example, the pattern matcher(s) 3404 can identify thelength of the string vector (e.g., identify the number of elements ortokens that comprise the string vector) and identify zero or more datapatterns in the first set that have the same length as the stringvector. The pattern matcher(s) 3404 can then compare the string vectorto just those data patterns in the first set having the same length. Thecomparison can include identifying whether the first token of the stringvector matches the first token of a data pattern, whether the secondtoken of the string vector matches the second token of a data pattern,and so on.

At block 4406, the one or more tokens are assigned to a first pattern inthe first set based on a distance between the one or more tokens and thefirst pattern being less than a minimum cluster distance. For example,the minimum cluster distance may be the minimum distance between any twodata patterns in the first set. The distance between the vector and thefirst pattern may be a distance between the vector and a centroid of thefirst pattern.

At block 4408, a weight and cluster location of the first pattern areupdated based on an assignment of the one or more tokens to the firstpattern. For example, the weight may represent a count of a number ofsets of one or more tokens (e.g., vectors) assigned to the firstpattern. Thus, the weight may be incremented by the pattern matcher(s)3404 by 1. The cluster location may be updated by the pattern matcher(s)3404 to take into account the location of the one or more tokens (e.g.,vector). Thus, locations of all the sets of one or more tokens (e.g.,vectors)—including the newly assigned one or more tokens (e.g.,vector)—assigned to the first pattern can be averaged by the patternmatcher(s) 3404 to determine the updated cluster location of the firstpattern.

At block 4410, the minimum cluster distance is updated based on theupdated cluster location of the first pattern. For example, the updatedcluster location of the first pattern may mean that the minimum clusterdistance has changed. Thus, the pattern matcher(s) 3404 can determinewhether the distance between the moved first pattern and the otherpatterns in the first set is less than the minimum cluster distance. Ifthe minimum cluster distance was not between the first pattern andanother pattern in the first set and none of the distances between themoved first pattern and the other patterns in the first set is less thanthe minimum cluster distance, then the pattern matcher(s) 3404 may keepthe minimum cluster distance as the same value. If the minimum clusterdistance was between the first pattern and another pattern in the firstset, then the pattern matcher(s) 3404 may recalculate some or all of thedistances between the patterns in the first set to determine a newminimum cluster distance. However, if at least one of the distancesbetween the first pattern and the other patterns in the first set isless than the minimum cluster distance, then the minimum clusterdistance may be updated by the pattern matcher(s) 3404 to be the lowestof the distances less than the previous minimum cluster distance.

Fewer, more, or different blocks can be used as part of the routine4400. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 44 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 45 is another flow diagram illustrative of an embodiment of aroutine 4500 implemented by the streaming data processor 308 to mergedata patterns in real-time. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 4500 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the streaming data processor 308.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 4502, a determination is made that a number of created patternsexceeds a threshold. For example, the threshold may be on the order of klog₁₀ n.

At block 4504, one or more patterns are merged to form a smaller set ofpatterns. For example, each pattern may be treated as a point tocluster, and a clustering algorithm (e.g., k-means, k-means ++, etc.)can be applied to the patterns to merge the patterns into a smaller setof patterns. The pattern matcher(s) 3404 may perform a hierarchicalmerge such that one or more complete patterns are merged together.

At block 4506, a minimum cluster distance is updated based on thesmaller set of patterns. For example, the smaller set of patterns maymean that the previous minimum cluster distance is no longer valid.Thus, the pattern matcher(s) 3404 can determine the distances betweeneach of the patterns in the smaller set to determine the new minimumcluster distance.

Fewer, more, or different blocks can be used as part of the routine4500. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 45 can be implemented in a variety of orders, or canbe performed concurrently.

4.15.4. Anomalous Pipeline Metric Detection Routines

FIG. 46 is a flow diagram illustrative of an embodiment of a routine4600 implemented by the streaming data processor 308 to detect ananomalous pipeline metric. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 4600 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the streaming data processor 308.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 4602, task manager and job manager logs are joined. Forexample, each log may include a job ID. The task manager and job managerlogs can be joined using the job ID. Specifically, logs that include thesame job ID can be joined or merged. In further embodiments, one or moreother types of application logs can be joined with or as an alternativeto the task manager and/or job manager logs.

At block 4604, a multi-variate time-series outlier detection isperformed on pipeline metrics corresponding to a first time to determinean outlier score. For example, the multi-variate time-series outlierdetection may indicate a distance from the pipeline metricscorresponding to the first time and a closest metric cluster (e.g., acentroid of a closest metric cluster). The pipeline metric outlierdetector(s) 3408 can set the outlier score for the pipeline metricscorresponding to the first time to be this distance.

At block 4606, a data structure corresponding to a first log is parsedto match with a pattern. For example, the pattern matcher(s) 3404 canidentify the length of the string vector (e.g., identify the number ofelements or tokens that comprise the string vector) and identify zero ormore data patterns that have the same length as the string vector. Thepattern matcher(s) 3404 can then compare the string vector to just thosedata patterns having the same length. The comparison can includeidentifying whether the first token of the string vector matches thefirst token of a data pattern, whether the second token of the stringvector matches the second token of a data pattern, and so on. Thepattern matcher(s) 3404 can match the data structure (e.g., stringvector) to the pattern based on a determination that the string vectoris closest to the pattern.

At block 4608, a determination is made that the first log correspondingto the first time is anomalous based on the pattern. For example, thefirst log may be anomalous because a token value of the string vectorcorresponding to the first log is below or above a certain percentile orbecause a number of string vectors assigned to the pattern is low.

At block 4610, an anomaly score corresponding to the first log iscombined with the outlier score to form a combined score. For example,the anomaly score may be a distance between the string vectorcorresponding to the first log and a closest pattern. The anomaly scoreand the outlier score can be combined using a weighted sum to form thecombined score.

At block 4612, a determination is made that the combined score satisfiesa threshold. For example, the combined score may exceed a threshold.

At block 4614, an alert is generated indicating that at least one of thepipeline metrics is anomalous because of an anomaly corresponding to thefirst log. For example, the combined score satisfying the threshold maycause the anomalous metric identifier 3410 to conclude that the pipelinemetrics being outliers is not a false positive.

Fewer, more, or different blocks can be used as part of the routine4600. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 46 can be implemented in a variety of orders, or canbe performed concurrently. For example, the log anomaly detection andthe pipeline metric outlier detection can occur sequentially in anyorder, in parallel, and/or overlapping in time.

FIG. 47 is a flow diagram illustrative of an embodiment of a routine4700 implemented by the streaming data processor 308 to detect ananomalous metric. Although described as being implemented by thestreaming data processor 308, it will be understood that the elementsoutlined for routine 4700 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the streaming data processor 308. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 4702, a multi-variate time-series outlier detection isperformed on set of metrics corresponding to a first time to determinean outlier score. For example, the multi-variate time-series outlierdetection may indicate a distance from the set of metrics correspondingto the first time and a closest metric cluster (e.g., a centroid of aclosest metric cluster). The pipeline metric outlier detector(s) 3408can set the outlier score for the pipeline metrics corresponding to thefirst time to be this distance.

At block 4704, a data structure corresponding to a first log is parsedto match with a pattern. For example, the pattern matcher(s) 3404 canidentify the length of the string vector (e.g., identify the number ofelements or tokens that comprise the string vector) and identify zero ormore data patterns that have the same length as the string vector. Thepattern matcher(s) 3404 can then compare the string vector to just thosedata patterns having the same length. The comparison can includeidentifying whether the first token of the string vector matches thefirst token of a data pattern, whether the second token of the stringvector matches the second token of a data pattern, and so on. Thepattern matcher(s) 3404 can match the data structure (e.g., stringvector) to the pattern based on a determination that the string vectoris closest to the pattern.

At block 4706, a determination is made that the first log correspondingto the first time is anomalous based on the pattern. For example, thefirst log may be anomalous because a token value of the string vectorcorresponding to the first log is below or above a certain percentile orbecause a number of string vectors assigned to the pattern is low.

At block 4708, an anomaly score corresponding to the first log iscombined with the outlier score to form a combined score. For example,the anomaly score may be a distance between the string vectorcorresponding to the first log and a closest pattern. The anomaly scoreand the outlier score can be combined using a weighted sum to form thecombined score.

At block 4710, a determination is made that the combined score satisfiesa threshold. For example, the combined score may exceed a threshold.

At block 4712, an alert is generated indicating that at least one of themetrics in the set is anomalous because of an anomaly corresponding tothe first log. For example, the combined score satisfying the thresholdmay cause the anomalous metric identifier 3410 to conclude that at leastone of the metrics in the set being an outlier is not a false positive.

Fewer, more, or different blocks can be used as part of the routine4700. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 47 can be implemented in a variety of orders, or canbe performed concurrently. For example, the log anomaly detection andthe metric outlier detection can occur sequentially in any order, inparallel, and/or overlapping in time.

FIG. 48 is a flow diagram illustrative of an embodiment of a routine4800 implemented by the streaming data processor 308 to assign a set ofmetrics to a metric cluster in real-time. Although described as beingimplemented by the streaming data processor 308, it will be understoodthat the elements outlined for routine 4800 can be implemented by one ormore computing devices/components that are associated with the intakesystem 210, such as, but not limited to, the streaming data processor308. Thus, the following illustrative embodiment should not be construedas limiting.

At block 4802, a set of metrics corresponding to a first time iscompared to a set of metric clusters. For example, the pipeline metricoutlier detector(s) 3408 can determine a distance between each of themetric clusters in the set and the set of metrics.

At block 4804, the set of metrics corresponding to the first time isassigned to a new metric cluster based on a distance between the set ofmetrics and each metric cluster in the set being greater than a minimumcluster distance. For example, the minimum cluster distance may be theminimum distance between any two metric clusters in the set. Thedistance between the set of metrics and each metric cluster may be adistance between the set of metrics and a centroid of each metriccluster.

At block 4806, the minimum cluster distance is updated based on thecreation of the new metric cluster. For example, the pipeline metricoutlier detector(s) 3408 can determine whether the distance between thenew metric cluster and any of the existing metric clusters is less thanthe minimum cluster distance. If none of the distances between the newmetric cluster and the existing metric clusters is less than the minimumcluster distance, then the pipeline metric outlier detector(s) 3408 maykeep the minimum cluster distance as the same value. However, if atleast one of the distances between the new metric cluster and theexisting metric clusters is less than the minimum cluster distance, thenthe minimum cluster distance may be updated by the pipeline metricoutlier detector(s) 3404 to be the lowest of the distances less than theprevious minimum cluster distance.

At block 4808, an outlier score of the set of metrics is set to be adistance between the set of metrics and the new metric cluster. Giventhat the set of metrics may be at the same location as the new metriccluster (at least until additional metrics are assigned to the newmetric cluster), the outlier score may be 0.

Fewer, more, or different blocks can be used as part of the routine4800. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 48 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 49 is another flow diagram illustrative of an embodiment of aroutine 4900 implemented by the streaming data processor 308 to assign aset of metrics to a metric cluster in real-time. Although described asbeing implemented by the streaming data processor 308, it will beunderstood that the elements outlined for routine 4900 can beimplemented by one or more computing devices/components that areassociated with the intake system 210, such as, but not limited to, thestreaming data processor 308. Thus, the following illustrativeembodiment should not be construed as limiting.

At block 4902, a set of metrics corresponding to a first time iscompared to a set of metric clusters. For example, the pipeline metricoutlier detector(s) 3408 can determine a distance between each of themetric clusters in the set and the set of metrics.

At block 4904, the set of metrics corresponding to the first time isassigned to a first metric cluster in the set based on a distancebetween the set of metrics and the first metric cluster being less thana minimum cluster distance. For example, the minimum cluster distancemay be the minimum distance between any two metric clusters in the set.The distance between the set of metrics and the first metric cluster maybe a distance between the set of metrics and a centroid of the firstmetric cluster.

At block 4906, a weight and cluster location of the first metric clusterare updated based on an assignment of the set of metrics to the firstmetric cluster. For example, the weight may represent a count of anumber of metric groups assigned to the first metric cluster. Thus, theweight may be incremented by the pipeline metric outlier detector(s)3408 by 1. The cluster location may be updated by the pipeline metricoutlier detector(s) 3408 to take into account the location of the set ofmetrics. Thus, locations of all the metric groups—including the newlyassigned set of metrics—assigned to the first metric cluster can beaveraged by the pipeline metric outlier detector(s) 3408 to determinethe updated cluster location of the first metric cluster.

At block 4908, the minimum cluster distance is updated based on theupdated cluster location of the first metric cluster. For example, theupdated cluster location of the first metric cluster may mean that theminimum cluster distance has changed. Thus, the pipeline metric outlierdetector(s) 3408 can determine whether the distance between the movedfirst metric cluster and the other metric clusters in the set is lessthan the minimum cluster distance. If the minimum cluster distance wasnot between the first metric cluster and another metric cluster in theset and none of the distances between the moved first metric cluster andthe other metric clusters in the set is less than the minimum clusterdistance, then the pipeline metric outlier detector(s) 3408 may keep theminimum cluster distance as the same value. If the minimum clusterdistance was between the first metric cluster and another metric clusterin the set, then the pipeline metric outlier detector(s) 3408 mayrecalculate some or all of the distances between the metric clusters inthe set to determine a new minimum cluster distance. However, if atleast one of the distances between the first metric cluster and theother metric clusters in the set is less than the minimum clusterdistance, then the minimum cluster distance may be updated by thepipeline metric outlier detector(s) 3408 to be the lowest of thedistances less than the previous minimum cluster distance.

At block 4910, an outlier score of the set of metrics is set to be adistance between the set of metrics and the first metric cluster. Forexample, the outlier score may be the distance between the set ofmetrics and a centroid of the moved first metric cluster.

Fewer, more, or different blocks can be used as part of the routine4900. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 49 can be implemented in a variety of orders, or canbe performed concurrently.

FIG. 50 is another flow diagram illustrative of an embodiment of aroutine 5000 implemented by the streaming data processor 308 to mergemetric clusters in real-time. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 5000 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the streaming data processor 308.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 5002, a determination is made that a number of created metricclusters exceeds a threshold. For example, the threshold may be on theorder of k log₁₀ n.

At block 5004, one or more metric clusters are merged to form a smallerset of patterns. For example, each metric cluster may be treated as apoint to cluster, and a clustering algorithm (e.g., k-means, k-means ++,etc.) can be applied to the metric clusters to merge the metric clustersinto a smaller set of metric clusters. The pipeline metric outlierdetector(s) 3408 may perform a hierarchical merge such that one or morecomplete metric clusters are merged together.

At block 5006, a minimum cluster distance is updated based on thesmaller set of metric clusters. For example, the smaller set of metricclusters may mean that the previous minimum cluster distance is nolonger valid. Thus, the pipeline metric outlier detector(s) 3408 candetermine the distances between each of the metric clusters in thesmaller set to determine the new minimum cluster distance.

Fewer, more, or different blocks can be used as part of the routine5000. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 50 can be implemented in a variety of orders, or canbe performed concurrently.

4.16. Online Machine Learning

Generally, machine learning models are trained and deployed using batchalgorithms. A batch algorithm may have access to all of the trainingdata at one time, and use the training data to train a machine learningmodel. Training and deploying machine learning models using batchalgorithms, however, may be difficult, time-intensive, andresource-intensive. For example, many batch algorithms are slow toconverge. Even if a batch algorithm converges quickly, such a batchalgorithm often uses too many computing resources (e.g., processingpower, memory usage, network or bus bandwidth, etc.) to perform theconvergence. In addition, the quality of a machine learning model may bea function of how often the machine learning model is trained andre-trained, not necessarily a function of how good the batch algorithmis that is used to train the machine learning model. To train a machinelearning model properly, a user may be required to have domain expertise(e.g., knowledge of what features in raw machine data are important andunimportant to the training process), time to parse raw machine data andidentify appropriate features in the raw machine data that can be usedto train the machine learning model, and expertise in how to perform thesteps to actually train a machine learning model. Even assuming the userhas the right expertise to identify appropriate features in the rawmachine data and complete the training process, a user may expend alarge amount of effort to identify appropriate features in the rawmachine data and a large amount of computing resources may be expendedto train the machine learning model given the high volume of raw machinedata that may be available.

Because of the effort expended to train a machine learning model onceusing a batch algorithm, a user may refrain from re-training the trainedmachine learning model, thereby sacrificing model accuracy forconvenience. In fact, even if the user attempted to re-train the trainedmachine learning model one or more times, the re-training process maytake a long period of time because of a lack of knowledge on whether there-trained machine learning model is more accurate than the originallytrained machine learning model. The user may also lack the ability toknow when to re-train the trained machine learning model or how often toperform the re-training If the user re-trains the trained machinelearning model too often, the computing resources used to perform there-training may be overused with little improvement in model accuracy.Conversely, if the user does not re-train the trained machine learningmodel often enough, then the resulting trained machine learning modelmay be inaccurate and perform poorly.

Finally, deploying a machine learning model trained by a batch machinelearning algorithm in a manner that reduces model inaccuracies isdifficult and may require a user to have deployment expertise (e.g.,knowledge in how to deploy batch machine learning algorithms into anactive environment, such as an environment in which data is ingested,processed, and stored for later consumption, rather than into a testenvironment). For example, batch machine learning algorithms are oftenwritten in one language optimized for training during a test or trainingphase (e.g., Python, Tensorflow, etc.), but are written in anotherlanguage optimized for production during a deployment phase (e.g.,Java). Because of the difference in the languages, a user may have torewrite some of the batch machine learning algorithm logic when it comestime to deploy the batch machine learning algorithm into an activeenvironment for the purpose of training a machine learning model. Thus,the batch machine learning algorithm may act differently during the testor training phase than during the deployment phase. To address thisissue, users generally write the batch machine learning algorithm usingthe training-optimized language in a manner that restricts the types oftransformations that are performed to just those transformations thatcan be easily converted into the production-optimized language.Artificially restricting the types of transformations that areperformed, however, reduces the accuracy of machine learning modelstrained using the batch machine learning algorithm. Other users mayaddress this issue by running the training-optimized language during thedeployment phase. However, the training-optimized language is notoptimized for low latency, high throughput, and/or other metrics thatare important for producing timely outputs during the deployment phase.Thus, these users may be forced to use additional computing resources torun the training-optimized language during the deployment phase and/ormay run machine learning algorithms with high latency, low throughput,and/or the like. Thus, users can either run batch machine learningalgorithms that produce inaccurate machine learning models or run batchmachine learning algorithms that perform slowly during deployment. Inthe context of the data processing pipeline described herein, it may beunacceptable to use inaccurate machine learning models or to run slowbatch machine learning algorithms written in different languages, asdoing so may make it difficult to produce a replicable data processingpipeline that uses machine learning, at least in part, to process data.

Not only is training and deploying machine learning models using batchalgorithms difficult, time-intensive, and resource-intensive, butavailable computing resources can also limit the accuracy of machinelearning models training using batch algorithms. For example, a user mayobtain a large amount of raw machine data. However, the amount ofcomputing resources available to process the raw machine data may belimited, and therefore the computing resources may not be capable ofprocessing all of the raw machine data to train a machine learningmodel. As a result, a user may sample the raw machine data and train themachine learning model on the sampled data. However, by sampling the rawmachine data, the user may be skipping raw machine data that may behelpful in training a more-accurate machine learning model.Alternatively, a user may use a complex machine learning algorithm totrain a machine learning model in an attempt to improve, but perform thetraining using a few features present in the raw machine data given thecomputing resources limitation. However, the scope of the types ofoutputs produced by the trained machine learning model may be limitedgiven that the user has restricted the types of features that are usedin the training Thus, limitations in the availability of computingresources can result in a batch algorithm being used to train a machinelearning model without all of the available raw machine data beingleveraged to perform the training It may be acceptable to train amachine learning model using some, but not all, of the available rawmachine data, but a batch algorithm provides no mechanism for indicatingor automatically obtaining relevant raw machine data (and/or discardingirrelevant raw machine data) for use in training a machine learningmodel when computing resources are limited.

Accordingly, described herein are various applications of an onlinemachine learning algorithm that can be used to train more-accuratemachine learning models in a manner that is less difficult,time-intensive, and resource-intensive. For example, the online machinealgorithm may not operate like a batch algorithm. Rather than havingaccess to all of the training data at one time to train a machinelearning model, the online machine learning algorithm can learn inreal-time as individual training data elements are obtained.Specifically, the online machine learning algorithm can obtain anindividual training data element, optionally train or re-train a machinelearning model using the individual training data element, obtain thenext individual training data element, optionally train or re-train themachine learning model using this next individual training data element,and so on. In other words, the online machine learning algorithm can usea previous learning to score the most-recently obtained training dataelement and optionally update the learning, even without having accessto all of the training data at one time.

Because the online machine learning algorithm processes a smaller volumeof data at any given time and processes the data as the data isobtained, the online machine learning algorithm may converge faster thana batch algorithm (and therefore can be applied to low latencyapplications), use fewer computing resources than a batch algorithm, cantrain a machine learning model using any volume of training data, andcan be used to train any number of machine learning models (e.g., theonline machine learning algorithms may be unbounded in cardinality). Theonline machine algorithm can determine, automatically without userintervention, when a machine learning model should be re-trained andperform the re-training, thereby producing machine learning models thatare more accurate than those produced by batch algorithms. Accuracy ofthe machine learning models produced by the online machine learningalgorithm is further improved by the fact that hyperparameters chosen toperform the training are not fixed or based on a static training datasetgiven that learning occurs in real-time. Rather, the hyperparameterschosen to perform the training can self-adjust as new training dataelements are obtained.

The online machine learning algorithm may further be structured suchthat a machine learning model state is separated from the code of theonline machine learning algorithm. Typically, a batch algorithm isstructured such that the machine learning model state is embedded withinthe code of the batch algorithm. If the batch algorithm is ever changed(e.g., upgraded), then a new machine learning model is trained using thechanged batch algorithm and the training data originally used to trainthe original machine learning model. Training the new machine learningmodel may cause data processing operations that use the machine learningmodel to pause or stop until the training is complete. By separating themachine learning model state from the online machine learning algorithmcode, however, the online machine learning algorithm code can be swappedor upgraded without requiring a new machine learning model be trainedusing the upgraded machine learning algorithm code and all of thepreviously seen training data when the swap or upgrade occurs and/orwithout pausing or stopping data processing operations that include useof a machine learning model trained by the original online machinelearning algorithm code. Rather, the swapped or upgraded machinelearning algorithm code can obtain the latest version of the machinelearning model trained by the original online machine learning algorithmcode, and start re-training this latest version using new training dataelements as the new training data elements are obtained. Thus, theonline machine learning algorithms can be swapped or upgraded withoutusing additional computing resources to redo previously-completedtraining and without delaying data processing operations.

Various applications of an online machine learning algorithm aredescribed below, including for adaptive thresholding, sequential outlierdetection, sentiment analysis, and drift detection in a data processingpipeline. However, these applications are not meant to be limiting. Thecharacteristics and features of the online machine learning algorithmdescribed herein can be applied to any other application that processesin real-time raw machine data, such as a stream of raw machine data thatis obtained and transformed by one or more components in a dataprocessing pipeline.

To implement the online machine learning described herein, the streamingdata processor 308 can run various tasks, including an adaptivethresholder 6002, a sequential outlier detector 6004, a sentimentanalyzer 6006, a drift detector 6008, an anomaly explainer 6010, and amachine learning algorithm swapper 6012, as shown in FIG. 60 . Any ofthese tasks, alone or in combination, can be applied to data passingthrough a pipeline, e.g., added to a data processing pipeline, thoughnot all tasks may be useful to all sets of data. The adaptivethresholder 6002 can detect, in real-time, whether an obtained rawmachine data element is an outlier as the raw machine data element isobtained, where the determination may be based on the values of the Nmost-recently obtained raw machine data elements. The adaptivethresholder 6002 can determine whether an obtained raw machine dataelement is an outlier using information derived from the N most-recentlyobtained raw machine data elements without having to store these Nmost-recently obtained raw machine data elements.

The sequential outlier detector 6004 can detect, in real-time, whether asequence of events included in obtained raw machine data is anomalous asthe raw machine data is obtained. The sentiment analyzer 6006 candetermine, in real-time, whether obtained raw machine data (e.g., text,such as messages, item reviews, social media postings, etc.) includes apositive sentiment or a negative sentiment as the raw machine data isobtained. The sentiment analyzer 6006 may use ratings or other labels(e.g., thumbs up, thumbs down, etc.) included in the obtained rawmachine data to train an online machine learning model to detectpositive or negative sentiment. The sentiment analyzer 6006 can then usethe trained online machine learning model to output an indication of thesentiment of obtained raw machine data and/or assign the raw machinedata a rating or label when the raw machine data does not include anyrating or label. The drift detector 6008 can detect, in real-time,whether an obtained raw machine data element marks a change in adistribution of a time-series as the raw machine data element isobtained. For example, a time-series may have one or more shifts in thepattern or trend of values, and the drift detector 6008 can detect theraw machine data elements that represent the beginning of these shiftsin real-time.

As described herein, the streaming data processor 308 (e.g., the anomalydetector 3406, the pipeline metric outlier detector 3408, etc.) candetect anomalous events or other fields. The anomaly explainer 6010 can,in real-time, identify correlations between anomalous token values, datapatterns, and/or pipeline metrics and other token values, data patterns,and/or pipeline metrics that might explain why the anomaly occurred. Insome implementations, the anomaly explainer 6010 implements thefunctionality of the anomalous metric identifier 3410 described hereinalternatively to or in addition to the functionality described hereinwith respect to the anomaly explainer 6010.

The machine learning algorithm swapper 6012 can perform A/B testing totest one or more machine learning algorithms while another machinelearning algorithm is implemented in a data processing pipeline toprocess raw machine data for storage, and can determine whether onemachine learning algorithm being tested is performing better than themachine learning algorithm implemented in the data processing pipelineto process raw machine data for storage. If the machine learningalgorithm swapper 6012 determines that one machine learning algorithmbeing tested is performing better than the machine learning algorithmimplemented in the data processing pipeline to process raw machine datafor storage, then the machine learning algorithm swapper 6012 can,without any downtime in the data processing pipeline, swap the code ofthe machine learning algorithm implemented in the data processingpipeline to process raw machine data for storage with the code of themachine learning algorithm being tested that has better performance.

Additional details of the adaptive thresholder 6002, the sequentialoutlier detector 6004, the sentiment analyzer 6006, the drift detector6008, the anomaly explainer 6010, and the machine learning algorithmswapper 6012 are provided below.

FIG. 61 is a flow diagram illustrative of an embodiment of a routine6100 implemented by the streaming data processor 308 to implement anonline machine learning model. Although described as being implementedby the streaming data processor 308, it will be understood that theelements outlined for routine 6100 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the streaming data processor 308.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 6102, a stream of raw machine data is obtained for processingby a data processing pipeline. For example, the stream of raw machinedata may be ingested into the intake system 210 for processing andstorage. Individual raw machine data in the stream may be ingested insequence, in parallel, and/or any combination thereof.

At block 6104, a prediction is generated for each raw machine data inthe stream using a machine learning model that is a component in thedata processing pipeline. For example, each raw machine data may betransformed one or more times by various components in the dataprocessing pipeline, with the machine learning model being one componentin the data processing pipeline that performs a transformation. Theprediction may indicate a property of the respective raw machine data,such as whether the respective raw machine data is an outlier,corresponding to an anomalous sequence, has a positive or negativesentiment, marks a change in a distribution of a time-series, and/or thelike.

At block 6106, for each raw machine data in the stream, the machinelearning model is evolved (e.g., updated, trained, re-trained, etc.) inresponse to the respective raw machine data satisfying a condition. Forexample, the condition may be that the respective raw machine data isassociated with a time that falls within a time window, that a sequenceof events associated with the respective raw machine data is more than aminimum distance from each data pattern in a set of data patterns, thatthe respective raw machine data lacks a rating or label, that therespective raw machine data is associated with a time that makes therespective raw machine data one of the N most-recent raw machine dataelements, and/or the like.

At block 6108, an output is generated based on at least some of thegenerated predictions. For example, the output may be an indication ofthose raw machine data in the stream that are outliers, an indication ofthose raw machine data in the stream that correspond to an anomaloussequence, the detected sentiment of some or all of the raw machine datain the stream, an indication of those raw machine data in the streamthat mark a change in a distribution of a time-series, and/or the like.

At block 6110, the output is provided to another component in the dataprocessing pipeline. For example, the other component may perform one ormore additional transformations on the output, may store the output, maydiscard the output, and/or the like.

Fewer, more, or different blocks can be used as part of the routine6100. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 61 can be implemented in a variety of orders, or canbe performed concurrently. For example, the generation of the predictionand the evolving of the machine learning model can occur sequentially inany order, in parallel, and/or overlapping in time.

4.16.1. Adaptive Thresholding

Adaptive thresholding can be used to compute anomalies or outliers invalues falling within a time window, such as in values falling withinthe last N seconds, minutes, days, weeks, months, etc., with theadaptive threshold computation being repeated periodically (e.g., everysecond, minute, day, week, month, etc.). For example, FIG. 62illustrates a graph 6200 depicting various values generated over time.Adaptive thresholding can be used to identify an anomalous value, takinginto account only those values that fall within time window 6202. Asillustrated in FIG. 62 , value 6204 may be identified as beinganomalous.

Typically, batch algorithms are used to perform adaptive thresholding.For example, the values falling within the time window 6202 may bestored and used by a batch algorithm to perform the adaptivethresholding. Given that a large volume of values may fall within thetime window 6202 and that the adaptive thresholding computation may berepeated often, however, the amount of available computing resources maylimit the number of different adaptive thresholding computations thatcan be run and/or the number of times an adaptive thresholdingcomputation can be repeated using the batch algorithm. Moreover, giventhat a large volume of values may fall within the time window 6202 andthat the adaptive thresholding computation may be repeated often, theamount of available computing resources may limit the number ofdifferent events or metrics upon which anomalies or outliers can bedetected using the batch algorithm. In fact, the amount of availablecomputing resources may further limit the number of values that can bestored. If a large number of values fall within the time window 6202,certain values may be omitted from the adaptive thresholding computationperformed using a batch algorithm, thereby reducing the accuracy of thecomputation.

Implementing adaptive thresholding using an online machine learningalgorithm, however, can overcome the technical deficiencies describedabove. In particular, the online machine learning algorithms thatperforms adaptive thresholding may not be as limited by the amount ofavailable computing resources given the design of the algorithm,allowing many different adaptive thresholding computations to beperformed and repeated any number of times and/or allowing adaptivethresholding to be performed on any number of events or metrics.

It can be difficult to implement an online machine learning algorithmthat performs adaptive thresholding, however. For example, because anonline machine learning algorithm evaluates each new raw machine dataelement as the respective new raw machine data element is obtained oringested, there may not be an opportunity to store each raw machine dataelement associated with a time falling within the time window 6202.Because the raw machine data elements may not be stored, it can also bedifficult to properly expire raw machine data elements (e.g., disregardraw machine data elements that are associated with times that now falloutside the time window 6202) such that the adaptive thresholdingcomputation is only being performed using raw machine data elements (orrepresentations thereof) associated with a time falling within the timewindow 6202. Finally, raw machine data elements can be ingested out oforder, meaning that some raw machine data elements obtained or ingestedearly on and relied upon as representing the oldest raw machine dataelements may actually be associated with times that are more recent thanthe times associated with other raw machine data elements obtained oringested more recently that may fall outside the time window 6202. Witha batch algorithm, raw machine data elements being ingested out of orderis not a concern because all of the raw machine data elements are known,and therefore the raw machine data elements can be sorted prior toperforming the adaptive thresholding computation. Sorting may not bepossible with an online machine learning algorithm given that all of theraw machine data elements associated with a time falling within the timewindow 6202 may not be known or stored. Ingesting raw machine dataelements out of order can therefore yield poor adaptive thresholdingresults.

The adaptive thresholder 6002 can implement an online machine learningalgorithm that performs adaptive thresholding and that is designed toovercome the technical deficiencies of typical online machine learningalgorithms described above. For example, the adaptive thresholder 6002can be a component in a data processing pipeline that performs adaptivethresholding operations, as shown in FIG. 63 . As illustrated in FIG. 63, raw machine data may originate from a data stream source 6302, whichmay be internal or external to the data intake and query system 108. Theraw machine data may be transformed by zero or more data processingcomponents 6304 before being provided to the adaptive thresholder 6002as an input. The adaptive thresholder 6002 can transform the providedraw machine data (e.g., by detecting whether the raw machine data or avalue therein is anomalous or an outlier) and produce a correspondingoutput. Zero or more data processing components 6306 can transform theoutput produced by the adaptive thresholder 6002 before the optionallytransformed output is written to an index 6308, such as the indexingsystem 212, and/or to any data store present in the data intake andquery system 108.

The adaptive thresholder 6002 can perform adaptive thresholding using anonline machine learning algorithm each time a new raw machine dataelement is obtained. To perform the adaptive thresholding, the adaptivethresholder 6002 can generate a quantile or Gaussian sketch for themost-recently obtained raw machine data element. A quantile or Gaussiansketch may be a downsampled version of a set of data that has similarstatistics (e.g., mean, variance, etc.) as the entire set of data. Theadaptive thresholder 6002 may have previously generated other quantileor Gaussian sketches, such as when previous raw machine data elements ina stream were obtained or ingested and/or when previously-generatedquantile or Gaussian sketches were merged together by the adaptivethresholder 6002. Thus, the adaptive thresholder 6002 may maintain asketch for the most-recently obtained raw machine data element and zeroor more sketches that were previously generated.

Each sketch may be associated with a starting timestamp (e.g., which maybe equivalent to a timestamp associated with the oldest raw machine dataelement represented by the sketch) and an ending timestamp (e.g., whichmay be equivalent to a timestamp associated with the newest raw machinedata element represented by the sketch). Thus, the adaptive thresholder6002 can analyze the starting timestamps associated with each sketch anddetermine whether any sketch has a starting timestamp that does not fallwithin the time window 6202 (where a sketch having a starting timestampfalling outside the time window 6202 indicates that the sketch includesat least one raw machine data element associated with a time fallingoutside the time window 6202). The adaptive thresholder 6002 can thendiscard those sketches having a starting timestamp that does not fallwithin the time window 6202. In this way, the adaptive thresholder 6002can effectively expire raw machine data elements associated with timesfalling outside the time window 6202, thereby ignoring such raw machinedata elements when performing the adaptive thresholding.

The adaptive thresholder 6002 may maintain the previously generatedsketch(es) in a sorted order, thereby maintaining a hierarchy ofpreviously generated sketch(es). For example, the adaptive thresholder6002 can maintain the previously generated sketch(es) in an order basedon the associated timestamps. Thus, the adaptive thresholder 6002 maymaintain a first and second sketch in an order in which the secondsketch follows the first sketch if the first sketch has an endingtimestamp that is earlier than the starting timestamp of the secondsketch. The adaptive thresholder 6002 can then place the sketch for themost-recently obtained raw machine data element in the hierarchy ofpreviously generated sketch(es) at a position determined based on thetimestamps associated with the most-recently obtained raw machine dataelement sketch (e.g., where the starting timestamp and the endingtimestamp may both be the time associated with the most-recentlyobtained raw machine data element). In this way, the adaptivethresholder 6002 can maintain a sorted order of sketches despite nothaving access to all of the underlying raw machine data elements at onetime, thereby avoiding the out-of-order ingestion issue described above.

Once the adaptive thresholder 6002 has placed the sketch in thehierarchy of previously generated sketch(es), the adaptive thresholder6002 can iterate through pairs of sketch(es) in the hierarchy, from mostrecent to least recent, to determine whether each respective pair ofsketches should be merged together. For example, the adaptivethresholder 6002 can determine a merge condition derived from arelationships between the sketch sizes before merging and the desirederror epsilon after merging. In particular, the adaptive thresholder6002 can temporarily merge a pair of sketches based on whether the error(e.g., error in a statistical metric, such as a difference between thestatistical metric of the merged pair of sketches and the statisticalmetric of an individual sketch or a group of sketches) resulting fromthe merged pair of sketches is within a threshold (e.g., 1+epsilon) ofthe error before merging some or all of the sketches in the hierarchy(e.g., all of the sketches already analyzed for the purposes ofmerging). If the error of the merged pair of sketches is less than thisbound (e.g., less than the threshold), then the adaptive thresholder6002 can officially merge the pair of sketches and move on to the nextpair of sketches (e.g., the next oldest sketch and the newly mergedsketch, the two next oldest sketches, etc.).

Once the adaptive thresholder 6002 has iterated through all of thesketches in the hierarchy to determine whether merging should occur, theadaptive thresholder 6002 can iterate through each of the remainingsketches in the hierarchy and determine, for the respective sketch, avalue of a lower quantile (e.g., the 25% quantile) and a value of anupper quantile (e.g., the 75% quantile). The adaptive thresholder 6002can determine the lower and upper quantile values based on the values ofthe raw machine data elements included in the respective sketch. As anexample, the adaptive thresholder 6002 can analyze the values of the rawmachine data elements included in the respective sketch and determinewhich of the values represents a 25% quantile of values and which of thevalues represents a 75% quantile of values. The adaptive thresholder6002 can then aggregate each of the determined lower quantile values andeach of the determined upper quantile values (e.g., average thedetermined lower quantile values and average the determined upperquantile values) to determine an aggregated lower quantile value and anaggregated upper quantile value.

The adaptive thresholder 6002 can use the aggregated lower quantilevalue and the aggregated upper quantile value to determine whether thevalue of the most-recently obtained raw machine data element isanomalous or an outlier. For example, the adaptive thresholder 6002 candetermine whether a value in the most-recently obtained raw machine dataelement falls below the aggregated lower quantile value or falls abovethe aggregated upper quantile value. If either scenario is true, thenthe adaptive threshold 6002 can determine that the value in themost-recently obtained raw machine data element is anomalous or anoutlier. The adaptive thresholder 6002 can repeat these operations eachtime a new raw machine data element is obtained or ingested.

The adaptive thresholder 6002 can store the generated sketches and/orthe hierarchy of sketches. Alternatively, a data store in the streamdata processor 308, not shown, may store the generated sketches and/orthe hierarchy of sketches, and the adaptive thresholder 6002 canretrieve the generated sketches and/or hierarchy information from thedata store.

FIG. 64 is a flow diagram illustrative of an embodiment of a routine6400 implemented by the streaming data processor 308 to perform adaptivethresholding. Although described as being implemented by the streamingdata processor 308, it will be understood that the elements outlined forroutine 6400 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the adaptive thresholder 6002. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 6402, variable i is set to 1. Variable i may represent aparticular raw machine data element in a stream of raw machine data.

At block 6404, any quantile sketches that are associated with expiredraw machine data may be discarded. For example, any quantile sketchesthat have a starting timestamp that occurs outside of a time window inwhich adaptive thresholding is to be performed may be discarded.

At block 6406, a quantile sketch is generated for raw machine data i.For example, raw machine data i may be the most-recently obtained oringested raw machine data element. The quantile sketch may be a Gaussiansketch and may include a value in raw machine data i.

Alternatively, block 6406 may be performed prior to block 6404. Thus, aquantile sketch for the most-recently obtained or ingested raw machinedata element may be performed before any quantile sketches arediscarded.

At block 6408, the generated quantile sketch is placed in a list ofgenerated quantile sketches. For example, the list of generated quantilesketches may be an ordered list or hierarchy of previously generatedquantile sketches, where such quantile sketches may be derived frompreviously obtained or ingested raw machine data elements and/or themerging of sketches, and in which the list or hierarchy may be orderedchronologically from least recent to most recent. The generated quantilesketch may be placed in an appropriate position in the list that isdetermined based on the timestamps associated with the generatedquantile sketch and the timestamps associated with the quantile sketchesin the list.

At block 6410, variable k is set to be equal to a number of quantilesketches in the list. Variable k may represent a particular quantilesketch in the list or hierarchy of quantile sketches.

At block 6412, a determination is made as to whether the variable k isgreater than 1. If the variable k is greater than 1, this indicates thatthere are additional quantile sketches that the adaptive thresholder6002 should still evaluate for merging purposes and the routine 6400proceeds to block 6414. Otherwise, if the variable k is less than orequal to 1, this indicates that the adaptive thresholder 6002 hasevaluated all of the quantile sketches for merging purposes and theroutine 6400 proceeds to block 6420.

At block 6414, a determination is made as to whether quantile sketch kshould be merged with quantile sketch k−1. For example, the adaptivethresholder 6002 can temporarily merge quantile sketches k and k−1, anddetermine whether the size of the merged quantile sketches k and k−1 isgreater than a size of a combination of the quantile sketches previouslyanalyzed for merging purposes (e.g., the more recent quantile sketches).If the size of the merged quantile sketches k and k−1 is greater thanthe size of the combination of the quantile sketches previously analyzedfor merging purposes, then the routine 6400 proceeds to block 6416 toofficially merge the quantile sketches k and k−1. Otherwise, if the sizeof the merged quantile sketches k and k−1 is not greater than the sizeof the combination of the quantile sketches previously analyzed formerging purposes, then the routine 6400 proceeds to block 6418 such thatquantile sketches k and k−1 are not merged.

At block 6416, quantile sketch k and quantile sketch k−1 are merged.Merging two quantile sketches may include combining at least some of theraw machine data elements included in one quantile sketch with at leastsome of the raw machine data elements included in the other quantilesketch.

At block 6418, the variable k is decremented by 1. Decrementing thevariable k represents the adaptive thresholder 6002 moving on toevaluate the next newest quantile sketch(es) for merging purposes. Oncethe variable k is decremented, the routine 6400 reverts back to block6412 so that the next quantile sketches can be evaluated to determinewhether merging should occur.

At block 6420, variable m is set to be equal to a number of quantilesketches in the list. Variable m may represent a particular quantilesketch in the list or hierarchy of quantile sketches.

At block 6422, a lower quantile and an upper quantile are determinedbased on quantile sketch m. For example, the adaptive thresholder 6002can apply a statistical operation to the values of the raw machine dataelements included in the quantile sketch m to determine a valuecorresponding to a lower quantile of values (e.g., the 25% percentile ofvalues) and a value corresponding to an upper quantile of values (e.g.,the 75% percentile of values).

At block 6424, the variable m is decremented by 1. Decrementing thevariable m represents the adaptive thresholder 6002 moving on to thenext quantile sketch to determine lower and upper quantiles.

At block 6426, a determination is made as to whether the variable m isgreater than 0. If the variable m is greater than 0, this may indicatethat lower and upper quantiles still need to be determined for one ormore quantile sketches and the routine 6400 reverts back to block 6422so that additional lower and upper quantiles can be determined.Otherwise, if the variable m is not greater than 0, this may indicatethat lower and upper quantiles have been determined for all of thequantile sketches in the list or hierarchy and the routine 6400 proceedsto block 6428.

At block 6428, an aggregated lower quantile and an aggregated upperquantile are determined using the determined lower and upper quantiles.For example, the adaptive thresholder 6002 can average the lowerquantiles of each of the quantile sketches to determine the aggregatedlower quantile, and can average the upper quantiles of each of thequantile sketches to determine the aggregated upper quantile.

At block 6430, a determination is made as to whether a value in rawmachine data i is an outlier using the aggregated upper quantiles and/orthe aggregated lower quantiles. For example, the adaptive thresholder6002 may determine that the value in raw machine data i is an outlier ifthe value falls below the aggregated lower quantile or falls above theaggregated upper quantile.

At block 6432, the variable i is incremented by 1. Incrementing thevariable i by 1 represents the adaptive thresholder 6002 obtaining thenext raw machine data element in the stream. After the variable i isincremented by 1, the routine 6400 reverts back to 6404 such thatadaptive thresholding can be performed on the newly obtained raw machinedata element.

Fewer, more, or different blocks can be used as part of the routine6400. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 64 can be implemented in a variety of orders, or canbe performed concurrently. For example, the quantile sketches can bemerged prior to any of the quantile sketches being discarded.

4.16.2. Sequential Outlier Detection

As described herein, individual logs or events comprised within rawmachine data may not include anomalous token values or be assigned to ananomalous data pattern. However, just because individual logs or eventshave normal values or are assigned to normal data patterns may not meanthat the logs or events, as a whole, are normal. For example, thesequence in which logs or events occur may be anomalous even if theindividual logs or events are normal. As an illustrative example, atrojan or other malicious process may perform operations that,individually, are normal. The sequence of operations, however, may beabnormal and lead to data being compromised, theft, malfunctions, and/orthe like.

As described herein, the anomaly detector 3406 can detect anomalies insequences of logs or events. The sequential outlier detector 6004 canalso detect anomalies in sequences of logs, events, or other raw machinedata, optionally implementing some or all of the functionality describedabove as being performed by the pattern matcher(s) 3404 and/or theanomaly detector 3406.

For example, the sequential outlier detector 6004 can be configured todetermine whether a sequence of logs or events comprised within rawmachine data (e.g., one or more individual raw machine data elements)matches any existing data pattern or whether the sequence should beassigned a new data pattern. The sequential outlier detector 6004 canstore information for one or more data patterns. A data pattern mayinclude one or more alphanumeric strings and zero or more wildcardsseparated by delimiters. Each alphanumeric string may represent a log orevent that is present in each sequence assigned to the data pattern atthe same position. A wildcard may indicate that the sequence(s) assignedto the data pattern include two or more different logs or events for thelog or event corresponding to the position of the wildcard. As anillustrative example, a data pattern may be as follows: “<*> LOG1 LOG2<*> LOG3 <*> <*>.” In this example, “<*>” represents a wildcard, eachword or number represents a log or event, and the blank spaces betweenthe wildcards and words represent delimiters. Thus, a sequence assignedto this data pattern may include any log or event in the first positionin the sequence, “LOG1” as the log or event in the second position inthe sequence, and so on. In some embodiments, a sequence may not beassigned to this data pattern if the sequence does not include “LOG1” asthe log or event in the second position (unless the streaming dataprocessor(s) 308 subsequently modifies the data pattern to replace“LOG1” with a wildcard).

To determine whether a sequence matches any existing data pattern orwhether the sequence should be assigned a new data pattern, thesequential outlier detector 6004 can identify existing data patterns, ifany, that correspond to sequences that have the same number of logs orevents as the number of logs or events comprised within the sequence.The sequential outlier detector 6004 then only compares the sequencewith these existing data patterns. In this way, the sequential outlierdetector 6004 can reduce the number of comparisons that are made toassign the sequence to a data pattern, thereby reducing sequentialanomaly detection times and the amount of computing resources dedicatedto detecting sequential anomalies in ingested data.

As described above, a data pattern can be represented by a clusterhaving a centroid. Each log or event position of the data pattern canrepresent a dimension in an m-dimensional space. Thus, the location of acentroid of a cluster (e.g., the location of a center or centroid of adata pattern) in the m-dimensional space can be determined by thesequential outlier detector 6004 based on the average log or event ofthe sequences assigned to the data pattern. For example, the sequentialoutlier detector 6004 can assign numerical values to each distinctstring present in a sequence assigned to the data pattern, add all ofthe assigned numerical values, and divide the sum by the number ofsequences assigned to the data pattern to determine the first dimensionvalue of the centroid of the data pattern. The sequential outlierdetector 6004 can repeat these operations for each dimension todetermine m dimension values that represent the centroid of the datapattern.

A user or the system can set a k value that represents a number ofclusters (e.g., data patterns) that should be created to which sequencescan be assigned. However, the sequence assignment described herein canoccur even if a k value is not set by a user or system. In anembodiment, the first time a sequence of logs or events isidentified—before any data patterns have been created by the sequentialoutlier detector 6004—the sequential outlier detector 6004 can assignthe first sequence to a new data pattern that matches the firstsequence. The second time a sequence is identified, the sequentialoutlier detector 6004 can assign the second sequence to a new datapattern as well that matches the second sequence. This process cancontinue for each subsequent sequence until k data patterns have beencreated.

At this point, the sequential outlier detector 6004 can evaluate thenext sequence (e.g., the k+1 sequence to be identified) to determinewhether the next sequence should be assigned to one of the k existingdata patterns or whether the next sequence should be assigned to a newdata pattern, and the sequential outlier detector 6004 can then assignthe next sequence to the appropriate data pattern. For example, thesequential outlier detector 6004 can maintain a minimum clusterdistance. The sequential outlier detector 6004 may determine a distance(e.g., a Euclidean distance, a Cosine distance, a Jaccard distance, anedit distance, etc.) between each data pattern having the same number oflogs or events, and repeat this determination for each set of datapatterns having the same number of logs or events. Specifically, thesequential outlier detector 6004 may determine a distance between thelocation of a center of a first data pattern and the location of acenter of a second data pattern having the same number of logs or eventsas the first data pattern. For each set of data patterns having the samenumber of logs or events, the sequential outlier detector 6004 candetermine the smallest distance between data patterns and set thisdistance as the minimum cluster distance for the respective set of datapatterns. Thus, the sequential outlier detector 6004 may determinemultiple minimum cluster distances, one for each set of data patternshaving the same length (e.g., the same number of logs or events or logor event positions). The sequential outlier detector 6004 can thendetermine a distance (e.g., a Euclidean distance, a Cosine distance, aJaccard distance, an edit distance, etc.) between the next sequence andeach existing data pattern having the same number of logs or events asthe next sequence. If the sequential outlier detector 6004 determinesthat this distance is less than or equal to the minimum cluster distancecorresponding to the set of data patterns having the same number of logsor events as the next sequence, this may indicate that the next sequenceis close enough to one of the existing data patterns to be assignedthereto. Thus, the sequential outlier detector 6004 can assign the nextsequence to the data pattern closest (e.g., by distance) to the nextsequence. Alternatively, the sequential outlier detector 6004 cancompare the next sequence to the existing data patterns having the samenumber of logs or events to determine whether the next sequence matchesany of these existing data patterns. For example, the sequential outlierdetector 6004 can compare each element of the next sequence with a logor event in an existing data pattern that has the same position as therespective element (e.g., the sequential outlier detector 6004 cancompare the first element with the first log or event in an existingdata pattern, the second element with the second log or event in anexisting data pattern, and so on), counting the number of times theelement and corresponding log or event match. The sequential outlierdetector 6004 can then divide the number of times the element andcorresponding log or event match for a given existing data pattern by alength of the next sequence (e.g., by the number of logs or eventsincluded therein) to produce a match percentage. The sequential outlierdetector 6004 can assign the next sequence to the existing data patternthat produces the highest match percentage. As part of the assignment,the sequential outlier detector 6004 can increase a weight of the datapattern by 1 (or any like value) to reflect that 1 additional sequencehas been assigned to the data pattern (e.g., update a count of a numberof sequences assigned to the data pattern to reflect that a new sequencehas been assigned to the data pattern) and can adjust a centroid of thedata pattern to account for the newly assigned sequence. Specifically,the sequential outlier detector 6004 can update the centroid of the datapattern by averaging the logs or events of the sequences previouslyassigned to the data pattern and of the next sequence to form an updatedm dimension values representing the centroid. Because the centroid ofthe data pattern has been updated, the sequential outlier detector 6004can also recalculate the minimum cluster distance for the datapattern(s) that have the same number of logs or events as the datapattern to which the next sequence is assigned, and the recalculatedminimum cluster distance can be used by the sequential outlier detector6004 in future data pattern assignment operations.

However, if the sequential outlier detector 6004 determines that thisdistance is greater than the minimum cluster distance corresponding tothe set of data patterns having the same number of logs or events as thenext sequence, this may indicate that the next sequence is too far fromany of the existing data patterns having the same number of logs orevents as the next sequence. Thus, the sequential outlier detector 6404can assign the next sequence to a new data pattern. Because creation ofthe new data pattern means that the number of data patterns having thesame number of logs or tokens as present in the new data pattern hasincreased, the sequential outlier detector 6004 can calculate orrecalculate the minimum cluster distance for the data pattern(s) thathave the same number of logs or events as the new data pattern to whichthe next sequence is assigned, and the recalculated minimum clusterdistance can be used by the sequential outlier detector 6004 in futuredata pattern assignment operations.

If the sequential outlier detector 6004 assigns a sequence to anexisting data pattern, the sequential outlier detector 6004 candetermine whether the existing data pattern properly describes thesequence. In particular, the sequential outlier detector 6004 candetermine whether any elements of the sequence do not match thecorresponding logs or events of the assigned data pattern (where anelement of the sequence is considered to match a log or event of theassigned data pattern if the value of the element is an alphanumericstring that matches the alphanumeric string of the log or event or ifthe log or event is a wildcard). If an element does not match acorresponding log or event, then the sequential outlier detector 6004can replace the log or event with a wildcard, thereby modifying theassigned data pattern to include a wildcard in place of the alphanumericstring that was previously present. As an illustrative example, if thesequence has the value “LOG2” in the fourth element, but the fourth logor event of the assigned data pattern is “LOG1,” then the sequentialoutlier detector 6004 can modify the fourth log or event in the assigneddata pattern to be “<*>” instead of “LOG1.” When modifying the datapattern to include a wildcard in place of an alphanumeric string, thesequential outlier detector 6004 can generate metadata associated withthe data pattern identifying the specific alphanumeric values or a rangeof alphanumeric values represented by the wildcard. In other words, thesequential outlier detector 6004 can generate metadata to track whatalphanumeric values are represented by a wildcard.

If the sequential outlier detector 6004 assigns a sequence to a new datapattern, the sequential outlier detector 6004 can define the new datapattern as being the elements of the sequence. As additional pieces ofingested data are obtained and processed, the sequential outlierdetector 6004 may modify this new data pattern to describe multiplesequences (e.g., the sequential outlier detector 6004 may replace somelogs or events that describe the data pattern with wildcards).

The sequential outlier detector 6004 can continue these operations forsubsequent sequences while the number of data patterns is greater than kand until the number of data patterns equals a threshold (e.g., athreshold that is on the order of k log₁₀ n, where n is the number ofsequences that have been received up to that point) or until a thresholdperiod of time has passed. Once the number of data patterns reaches thethreshold or the threshold period of time has passed, the sequentialoutlier detector 6004 can perform a merge operation to reduce the numberof data patterns. For example, the sequential outlier detector 6004 canuse a clustering algorithm (e.g., k-means ++)—treating each data patternas a separate point to cluster—to generate a new, smaller set of datapatterns in which one or more of the existing data patterns have beenmerged together. For example, the clustering algorithm can take one ormore passes (e.g., 1, 2, 3, etc.) on the existing data patterns togenerate the new, smaller set of data patterns. Data patterns may bemerged by the sequential outlier detector 6004 hierarchically, meaningthat two or more data patterns can be merged together to form a single,merged data pattern and one or more sets of data patterns can beseparately merged together. The sequential outlier detector 6004 canre-assign sequences that were previously assigned to the data patternsthat were merged to the merged data pattern. A merged data pattern mayhave a definition that appropriately describes each of the sequencesthat were previously assigned to the data patterns that were merged toform the merged data pattern and that are now assigned to the mergeddata pattern. The sequential outlier detector 6004 can then continuethese operations for each subsequent sequence that is identified.

Because the number of data patterns may be reduced after a mergeoperation, the sequential outlier detector 6004 can recalculate theminimum cluster distance for the data pattern(s) that have the samenumber of logs or events as the data pattern(s) that were mergedtogether, and the recalculated minimum cluster distance can be used bythe sequential outlier detector 6004 in future data pattern assignmentoperations. In some embodiments, a merge operation causes the minimumcluster distance to increase given that fewer data patterns remainBecause the sequential outlier detector 6004 creates a new data patternwhen the distance between a comparable data structure and the closestdata pattern is greater than the minimum cluster distance, the increasein the minimum cluster distance from the merge operation may inherentlycause the number of new data patterns being created to remain low. Thus,the number of data patterns may gravitate toward being k rather than thethreshold, increasing accuracy and reducing computational costs.

Because the data to cluster is known when clustering occurs offline(e.g., not in real-time, but sometime after data has been ingested andstored, such as periodically in batches), a traditional clustering batchalgorithm can run multiple passes on the data and produce exactly k (orfewer) clusters. When attempting to cluster data online or in real-time(e.g., when attempting to assign sequences to data patterns online or inreal-time as the raw machine data including the logs or events areingested), data previously received is known, but the data to bereceived in the future is unknown. To use a traditional clustering batchalgorithm, the sequential outlier detector 6004 may have to obtain thepreviously identified sequences and a sequence that was just identified,and apply the traditional clustering batch algorithm to these sequencesto obtain a new set of data patterns to which the sequences areassigned. The sequential outlier detector 6004 would then have to repeatthese operations each time a new sequence or a new set of sequences arereceived. The sequential outlier detector 6004 described herein iscapable of assigning sequences to data patterns in batches using atraditional clustering algorithm (e.g., k-means clustering) in a manneras described above. It may be too computationally costly, however, forthe sequential outlier detector 6004 to generate new data patterns andre-assign previously identified sequences to the new data patterns eachtime a new sequence is identified using a traditional clusteringalgorithm. As each new sequence is identified, the number of sequencesto assign to a data pattern would grow. Over time, the latency of thestreaming data processor(s) 308 would increase, thereby incrementallyincreasing anomaly detection times.

The online clustering algorithm described above as being implemented bythe sequential outlier detector 6004, however, can allow the sequentialoutlier detector 6004 to accurately assign sequences to data patternsonline or in real-time without experiencing the incrementally higherdelay or computational costs that would result from using a traditionalclustering batch algorithm. To achieve this technical benefit, thesequential outlier detector 6004 may not necessarily create exactly kclusters or data patterns. Rather, the sequential outlier detector 6004may maintain a number of data patterns greater than k and less than thethreshold (e.g., a threshold that is on the order of k log₁₀ n, where nis the number of sequences that have been identified up to that point),with the number of data patterns generally being closer to k than to thethreshold. The sequential outlier detector 6004 may maintain this numberof data patterns even after a merge operation occurs. Thus, thesequential outlier detector 6004 can create data patterns, assignsequences to data patterns, and merge data patterns in real-time withoutbeing negatively affected by the drawbacks associated with using atraditional clustering batch algorithm.

After performing the assignment and/or merge operations, the sequentialoutlier detector 6004 can then analyze the assigned sequences,identifying those sequences assigned to a data pattern that have anoccurrence among all of the sequences assigned to the data pattern lessthan a threshold or percentile or greater than a threshold orpercentile. The sequential outlier detector 6004 can then determine thatthe identified sequence(s) are anomalous. Alternatively or in addition,the sequential outlier detector 6004 can analyze the logs or events ofsequences assigned to a data pattern that correspond with a wildcard,and identify those logs or events that have an occurrence among all ofthe logs or events corresponding to the wildcard less than a thresholdor percentile or greater than a threshold or percentile. The sequentialoutlier detector 6004 can then determine that the sequence(s) thatinclude the identified log(s) or event(s) are anomalous. Alternativelyor in addition, the sequential outlier detector 6004 can identify thosesequences assigned to a data pattern having a small number (e.g., 1, 2,3, etc.) of assigned sequences, and determine that the identifiedsequence(s) are anomalous.

In an embodiment, the sequential outlier detector 6004 can be acomponent in a data processing pipeline that performs sequential outlierdetection, as shown in FIG. 65 . As illustrated in FIG. 65 , raw machinedata may originate from a data stream source 6502, which may be internalor external to the data intake and query system 108. The raw machinedata may be transformed by zero or more data processing components 6504before being provided to the sequential outlier detector 6004 as aninput. The sequential outlier detector 6004 can transform the providedraw machine data (e.g., by detecting whether the raw machine datacorresponds to an anomalous sequence of logs or events) and produce acorresponding output. Zero or more data processing components 6506 cantransform the output produced by the sequential outlier detector 6004before the optionally transformed output is written to an index 6508,such as the indexing system 212, and/or to any data store present in thedata intake and query system 108.

FIG. 66 is a flow diagram illustrative of an embodiment of a routine6600 implemented by the streaming data processor 308 to performsequential outlier detection. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 6600 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the sequential outlier detector 6004.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 6602, a sequence of one or more events is extracted from rawmachine data. The sequence of event(s) can be extracted from a singleraw machine data element or multiple raw machine data elements ingestedover a period of time.

At block 6604, the sequence is compared to one or more patterns (e.g.,data patterns). For example, the sequential outlier detector 6004 canidentify the length of a string vector representing the sequence (e.g.,identify the number of logs or events that comprise the string vectorrepresenting the sequence) and identify zero or more data patterns thathave the same length as the string vector. The sequential outlierdetector 6004 can then compare the string vector to just those datapatterns having the same length. The comparison can include identifyingwhether the first log or event of the string vector matches the firstlog or event of a data pattern, whether the second log or event of thestring vector matches the second log or event of a data pattern, and soon.

At block 6606, the sequence is assigned to a new pattern based on adistance between the sequence and each of the one or more patterns beinggreater than a minimum cluster distance. For example, the new patternmay include the logs or events of the sequence.

At block 6608, the sequence is determined to be anomalous in response tothe assignment of the sequence to the new pattern. For example, thesequence may be identified as being anomalous because the sequence isabnormal when compared to other sequences that have previously beenidentified.

Fewer, more, or different blocks can be used as part of the routine6600. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 66 can be implemented in a variety of orders, or canbe performed concurrently. For example, the sequence can be determinedto be anomalous before being assigned to the new pattern.

FIG. 67 is another flow diagram illustrative of an embodiment of aroutine 6700 implemented by the streaming data processor 308 to performsequential outlier detection. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 6700 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the sequential outlier detector 6004.Thus, the following illustrative embodiment should not be construed aslimiting.

At block 6702, a sequence of one or more events is extracted from rawmachine data. The sequence of event(s) can be extracted from a singleraw machine data element or multiple raw machine data elements ingestedover a period of time.

At block 6704, the sequence is compared to one or more patterns (e.g.,data patterns). For example, the sequential outlier detector 6004 canidentify the length of a string vector representing the sequence (e.g.,identify the number of logs or events that comprise the string vectorrepresenting the sequence) and identify zero or more data patterns thathave the same length as the string vector. The sequential outlierdetector 6004 can then compare the string vector to just those datapatterns having the same length. The comparison can include identifyingwhether the first log or event of the string vector matches the firstlog or event of a data pattern, whether the second log or event of thestring vector matches the second log or event of a data pattern, and soon.

At block 6706, a determination is made that the sequence corresponds toa first pattern. For example, the sequential outlier detector 6004 candetermine that the string vector corresponds to the first patternbecause the string vector has the highest match rate with the firstpattern (e.g., more of the string vector logs or events match the firstpattern logs or events than the logs or events of other data patterns).

At block 6708, a determination is made that the sequence does notcompletely match the first pattern. For example, the sequential outlierdetector 6004 may determine that while the string vector corresponds tothe first pattern, the first pattern does not completely describe thestring vector. The first pattern may not completely describe the stringvector because, for example, one log or event of the string vector(e.g., “LOG1”) is not equal to a corresponding log or event of the firstpattern (e.g., “LOG2”).

At block 6710, the first pattern is updated to include a wildcard. Forexample, the sequential outlier detector 6004 can update the firstpattern to include a wildcard instead of a log or event for the log orevent that does not match the corresponding log or event of the stringvector. In this way, the first pattern can be updated to include awildcard so that the first pattern now completely describes the stringvector.

At block 6712, a first event of the first pattern is analyzed todetermine percentiles of values. In other words, the first event of thefirst pattern can be analyzed to determine a distribution of valuescorresponding to the first event. For example, the first event of thefirst pattern may be a wildcard. The sequential outlier detector 6004can identify all of the events that are represented by the wildcard, anddetermine the percentiles of the occurrence of these events or otherstatistics.

At block 6714, the sequence is detected as being anomalous based onvalues that fall below or above a threshold percentile. For example, thesequential outlier detector 6004 can determine that a sequence that hasa log or event corresponding to the first log or event of the firstpattern with an occurrence falling below a certain percentile or fallingabove a certain percentile may be anomalous. As a result, the sequencecan be flagged as being anomalous for having at least one log or eventthat appears to be anomalous. A user can subsequently confirm whetherthe sequence is actually anomalous to improve future anomaly detections.

Fewer, more, or different blocks can be used as part of the routine6700. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 67 can be implemented in a variety of orders, or canbe performed concurrently.

4.16.3. Sentiment Analysis

Increasingly, users are transmitting messages, submitting item reviews,submitting social media postings, and/or providing other types of text.In some cases, a message, item review, social media posting, or othertype of text is associated with a rating or label from which a sentimentof the message, item review, social media posting, and/or other type oftext can be inferred. For example, a user may submit a review of an itemand assign the item five out of five stars. As another example, a usermay submit a social media posting that prompts the user or other usersto hit a “thumbs up” button. In other cases, however, a message, itemreview, social media posting, and/or other type of text may not beassociated with any rating or label. Thus, it may be difficult todetermine the sentiment of such messages, item reviews, social mediapostings, and/or other types of text.

Accordingly, the sentiment analyzer 6006 can implement an online machinelearning algorithm to learn from messages, item reviews, social mediapostings, or other types of text that are associated with ratings orlabels from which sentiment could be inferred, and to assign ratings orlabels and infer sentiment from messages, item reviews, social mediapostings, or other types of text that lack ratings or labels from whichsentiment could otherwise be inferred. The sentiment analyzer 6006 canbe a component in a data processing pipeline that performs sentimentanalysis, as shown in FIG. 68 . As illustrated in FIG. 68 , raw machinedata may originate from a data stream source 6802, which may be internalor external to the data intake and query system 108. The raw machinedata may be transformed by zero or more data processing components 6804before being provided to the sentiment analyzer 6006 as an input. Thesentiment analyzer 6006 can transform the provided raw machine data(e.g., by predicting a sentiment of the text included in the raw machinedata) and produce a corresponding output. Zero or more data processingcomponents 6806 can transform the output produced by the sentimentanalyzer 6006 before the optionally transformed output is written to anindex 6808, such as the indexing system 212, and/or to any data storepresent in the data intake and query system 108.

FIG. 69 illustrates an example block diagram of the sentiment analyzer6006 depicting operations that are performed when raw machine dataincludes both text 6901 and a rating or label 6910. As illustrated inFIG. 69 , the sentiment analyzer 6006 can include a tokenizer 6902, avector generator 6904, an online stochastic gradient descent (SGD) model6906, and an output comparator 6908.

The tokenizer 6902 can take the text 6901 comprised within ingested rawmachine data and extract one or more tokens 6903 or fields from the text6901. In some embodiments, the text 6901 may include multiple tokens6903. The tokenizer 6902 may extract some, but not all, of the tokens6903 from the text 6901 or may extract all of the tokens 6903 from thetext 6901. The tokenizer 6902 can pass the extracted token(s) 6903 tothe vector generator 6904.

The vector generator 6904 can generate a vector 6905 using the token(s)6903. For example, the vector generator 6904 can use an algorithm, suchas hashing TF or CountVectorizer, to generate the vector 6905 using thetoken(s) 6903.

The online SGD model 6906 may output a determined sentiment 6907 of thetext 6901 in response to receiving the vector 6905 as an input. Theonline SGD model 6906 may be trained and re-trained using an online SGDalgorithm, periodically or continuously optimized by the online SGDalgorithm to minimize a difference between the determined sentiment 6907and the actual sentiment of the text 6901. Alternatively, the online SGDmodel 6906 can output a predicted rating or label of the text 6901 inresponse to receiving the vector 6905 as an input, and the online SGDmodel 6906 may be periodically or continuously optimized by the onlineSGD algorithm to minimize a difference between the predicted rating orlabel and the assigned rating or label of the text 6901.

The output comparator 6908 can implement the online SGD algorithm. Forexample, the output comparator 6908 can receive the rating or label 6910as an input and infer a sentiment (e.g., a positive sentiment, anegative sentiment, a neutral sentiment, etc.) from the rating or label6910. In some embodiments, a high rating or label (e.g., 4 or 5 starsout of 5 stars, a thumb up selection, etc.) may indicate a positivesentiment, a low rating or label (e.g., 1 or 2 stars out of 5 stars, athumbs down selection etc.) may indicate a negative sentiment, and amedium rating or label (e.g., 3 stars out of 5 stars, no thumbs up ordown selection, etc.) may indicate a neutral sentiment. The outputcomparator 6908 can then compare the determined sentiment 6907 with theinferred sentiment (or infer a sentiment from the predicted rating orlabel, and compare the sentiment inferred from the predicted rating orlabel with the inferred sentiment). If the difference between thedetermined sentiment 6907 and the inferred sentiment (e.g., loss 6911)is greater than a loss determined using previously ingested raw machinedata, then the output comparator 6908 can generate updated modelparameters based on a step size selected for the online SGD algorithmand the value of the loss 6911 (or the difference between the loss 6911and a previous loss), in accordance with the online SGD algorithm. Forexample, the updated model parameters may be generated in an attempt toreduce future losses. If the loss 6911 is less than a loss determinedusing previously ingested raw machine data, then the output comparator6908 optionally generates updated model parameters based on a step sizeselected for the online SGD algorithm and the value of the loss 6911 (orthe difference between the loss 6911 and a previous loss) to furtherreduce future losses, in accordance with the online SGD algorithm. Theoutput comparator 6908 can then update the online SGD model 6906 usingthe updated model parameters. The output comparator 6908 may furtheroutput the determined sentiment 6907 and/or the loss 6911 to the nextcomponent in the data processing pipeline. In this way, the sentimentanalyzer 6006 can learn from ingested raw machine data that includestext and a rating or label to improve sentiment detection in raw machinedata ingested in the future, such as raw machine data that lacks arating or label.

FIG. 70 illustrates an example block diagram of the sentiment analyzer6006 depicting operations that are performed when raw machine dataincludes the text 6901, but no rating or label 6910. As illustrated inFIG. 70 , the tokenizer generates token(s) 6903 based on the text 6901,and the vector generator 6904 generates the vector 6905.

The online SGD model 6906 trained and/or re-trained by the outputcomparator 6908 can take the vector 6905 as an input and generate adetermined sentiment 7007 of the text 6901 and/or a rating or label7008. For example, the online SGD model 6906 can use the vector 6905 toassign a rating or label 7008 to the text 6901. In particular, theonline SGD model may be trained to recognize certain vector elements(e.g., hashed tokens) as having a positive sentiment, negativesentiment, neutral sentiment, etc. using ingested raw machine data thatincludes ratings or labels. Thus, the online SGD model 6906 can outputthe rating or label 7008 based on the training when no rating or label6910 is included in ingested raw machine data. As described above, thesentiment analyzer 6006 can infer a sentiment of the text 6901 based onthe assigned rating or label. Thus, the online SGD model 6906 (or theoutput comparator 6908) can infer the determined sentiment 7007 based onthe generated rating or label 7008. The online SGD model 6906 mayfurther output the determined sentiment 7007 and/or the rating or label7008 to the next component in the data processing pipeline. In this way,the sentiment analyzer 6006 can detect the sentiment of ingested rawmachine data (e.g., ingested text) when the ingested raw machine data isnot associated with or does not include a rating or label from which thesentiment could otherwise be inferred.

In some embodiments, the online SGD algorithm implemented by thesentiment analyzer 6006 can be an adaptive online SGD algorithm (e.g.,online SGD with AdaGrad). In other embodiments, the online SGD algorithmimplemented by the sentiment analyzer 6006 can be a norm version of anadaptive online SGD algorithm (e.g., online SGD with AdaGrad and/orAdaptive Norm).

FIG. 71 is a flow diagram illustrative of an embodiment of a routine7100 implemented by the streaming data processor 308 to performsentiment analysis. Although described as being implemented by thestreaming data processor 308, it will be understood that the elementsoutlined for routine 7100 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the sentiment analyzer 6006. Thus, the followingillustrative embodiment should not be construed as limiting.

At block 7102, one or more tokens are generated using text. For example,the text may be comprised within ingested raw machine data. The tokensmay each represent different alphanumeric strings (e.g., words, phrases,etc.) comprised within the text.

At block 7104, a vector is generated using the one or more tokens. Forexample, hashing TF can be used by the sentiment analyzer 6006 to hasheach of the tokens and to organize the hashed tokens as elements in avector.

At block 7106, the vector is applied as an input to an online SGD modelto produce a prediction. For example, the prediction may be a predictedsentiment of the text and/or a rating or label to assign to the text(e.g., if no rating or label accompanies the text in the ingested rawmachine data). In some embodiments, the online SGD model can predict arating or label, and the sentiment analyzer 6006 can then infer asentiment from the predicted rating or label to produce the prediction.

At block 7108, the prediction is compared to a rating. For example, theprediction may be a rating or label and may be compared to a rating orif a rating or label accompanies the text. Alternatively, the sentimentanalyzer 6006 can predict a rating or label at block 7106, can infer asentiment from the predicted rating or label, and can compare thesentiment inferred from the predicted rating or label with a sentimentinferred from a rating or label included in ingested raw machine data.

At block 7110, the online SGD model is updated based on the comparison.For example, the comparison can yield a loss representing a differencebetween the predicted rating or label and the rating or label comprisedwithin the ingested raw machine data (or a difference between asentiment inferred from the predicted rating or label and a sentimentinferred from the rating or label comprised within the ingested rawmachine data). The sentiment analyzer 6006 can generate one set ofupdated model parameters to update the online SGD model if thecomparison yields a loss that is greater than a previously generatedloss, and can generate another set of updated model parameters to updatethe online SGD model if the comparison yields a loss that is less than apreviously generated loss.

At block 7112, the prediction is outputted. For example, the predictionmay be output to another component in a data processing pipeline. Asdescribed herein, the prediction can include a rating or label, adetermined sentiment, a loss, and/or the like.

Fewer, more, or different blocks can be used as part of the routine7100. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 71 can be implemented in a variety of orders, or canbe performed concurrently. For example, the prediction can be outputtedbefore the online SGD model is updated. As another example, the onlineSGD model may not be updated if the online SGD algorithm has alreadydetermined model parameters for the online SGD model that minimize theloss.

4.16.4. Drift Detection

Time-series data often follows a trend or pattern. In some cases, thetrend or pattern can shift. In other words, the time-series data mayhave a certain distribution over one period of time, but may shift tohave another distribution over a subsequent period of time. As anillustrative example, FIG. 72 illustrates a graph 7200 showingtime-series data values. The time-series has one distribution until ashift occurs at time-series data value 7202. The time-series has thisdistribution until another shift occurs at time-series data value 7204.Further shifts occur at time-series data values 7206, 7208, and 7210.

Detecting a time at which the shift occurs can be difficult in real-timeas time-series data is ingested, however. For example, even if themost-recently ingested time-series data value appears different than thepreviously ingested time-series data values, the most-recently ingestedtime-series data value could simply be an outlier and not the start of ashift in the trend or pattern of the time-series data.

In an offline or batch setting, the Kolmogorov-Smirnov test (K-S test),the mean and variance test (e.g., mean and variance can be calculated ona set of time-series data values over one time period and a second timeperiod, where a variance shift is detected if the means are the same butthe variances are different, and where a mean shift is detected if thevariances are the same but the means are different), or ExchangeabilityMartingales can be used to identify a shift in the trend or pattern ofthe time-series data. These tests perform poorly if applied in an onlinesetting, however. For example, the mean and variance test is susceptiveto outlier time-series data values, and therefore provides poor results.The K-S test, if applied in an online setting, may result in a systempredetermining a time window and, for every time-series data value,redoing the K-S test computation. Application of the ExchangeabilityMartingales in an online setting would result in a similar situation.Thus, using the K-S test or Exchangeability Martingales in an onlinesetting may be very computationally intensive and result in slowperformance if computing resources are limited.

To address these technical deficiencies, a modified version of an onlineBayesian changepoint detection algorithm can be used to detect shifts inthe trend or pattern of ingested time-series data in real-time as thetime-series data (e.g., raw machine data) is ingested. For example, theonline Bayesian changepoint detection algorithm is described in Adams etal., “Bayesian Online Changepoint Detection,” Oct. 19, 2007 (“Adams”),which is hereby incorporated by reference herein in its entirety. Theonline Bayesian changepoint detection algorithm disclosed in Adams mayread one time-series data value at a time and provide an estimate of thelikelihood that a read time-series data value is a changepoint ortransition point at which the distribution of a time-series shifts. Theonline Bayesian changepoint detection algorithm disclosed in Adams maygenerate the estimate based on time-series data values read up to thepoint in time of the current time-series data value being read.

While the online Bayesian changepoint detection algorithm disclosed inAdams produces accurate results in an online setting, the algorithm usesall previous time-series data values to generate the estimate. With asmall, finite dataset, the algorithm may be appropriate. However, thealgorithm may begin to slow down as the number of time-series datavalues that are read increases given that all previous time-series datavalues are analyzed each time an estimate is generated. Thus, thealgorithm may be too resource intensive for detecting shifts in thedistribution of a time-series in an online setting.

The modified version of the online Bayesian changepoint detectionalgorithm, however, can detect shifts in the distribution of atime-series in an online setting without consuming as many computingresources. For example, the drift detector 6008 can implement themodified version of the online Bayesian changepoint detection algorithm.The drift detector 6008 can be a component in a data processing pipelinethat performs time-series drift detection, as shown in FIG. 73 . Asillustrated in FIG. 73 , raw machine data may originate from a datastream source 7302, which may be internal or external to the data intakeand query system 108. The raw machine data may be transformed by zero ormore data processing components 7304 before being provided to the driftdetector 6008 as an input. The drift detector 6008 can transform theprovided raw machine data (e.g., by determining a likelihood that theraw machine data represents a changepoint or transition point at whichthe distribution of the time-series has shifted) and produce acorresponding output. Zero or more data processing components 7306 cantransform the output produced by the drift detector 6008 before theoptionally transformed output is written to an index 7308, such as theindexing system 212, and/or to any data store present in the data intakeand query system 108.

Rather than storing information derived from all of the previouslyingested time-series data values, the drift detector 6008 may store asubset of information derived from the previously ingested time-seriesdata values. In particular, the drift detector 6008 can storeinformation derived from the last N (e.g., 20, 30, 50, 100, etc.)ingested time-series data values rather than information derived fromall of the previously ingested N time-series data values.

The information derived from an ingested time-series data value may be aprobability distribution. For example, the drift detector 6008 candetermine a probability distribution for an ingested raw machine dataelement (e.g., a time-series data value) using the online Bayesianchangepoint detection algorithm. The probability distribution may beassociated with a time (e.g., a timestamp associated with the ingestedraw machine data element). Before, during, and/or after the driftdetector 6008 determines the probability distribution for the ingestedraw machine data element, the drift detector 6008 can analyze previouslygenerated probability distributions (e.g., generated for previouslyingested raw machine data elements) and discard any of the previouslygenerated probability distributions associated with a time outside atime window. For example, ingested raw machine data elements may begenerated in periodic intervals, and therefore the time window maycorrespond to N raw machine data elements. In some embodiments, the timewindow may start at some time t before a current time and end at thecurrent time.

For each of the remaining previously generated probabilitydistributions, the drift detector 6008 can optionally adjust therespective probability distribution based on the probabilitydistribution of the most-recently ingested raw machine data element. Forexample, the remaining previously generated probability distributionscan be adjusted to take into account the occurrence of the most-recentlyingested raw machine data element. The adjustment can be performed bythe drift detector 6008 in accordance with the online Bayesianchangepoint detection algorithm. For each of the remaining probabilitydistributions (including the probability distribution of themost-recently ingested raw machine data element), the drift detector6008 can optionally adjust the respective probability distribution(e.g., adjust a mean of the respective probability distribution) basedon some or all of the discarded probability distributions. For example,the remaining probability distributions can be adjusted such that themean of the remaining probability distributions is equivalent to themean of the probability distributions if none of the discardedprobability distributions had been discarded.

Once the remaining probability distributions are optionally adjusted,the drift detector 6008 can use the online Bayesian changepointdetection algorithm and the optionally adjusted probabilitydistributions to determine a likelihood that the most-recently ingestedraw machine data element marks a changepoint or transition point atwhich the distribution of the time-series has shifted. The driftdetector 6008 can provide the likelihood as an input to anothercomponent in the data processing pipeline.

The drift detector 6008 can store the adjusted and/or unadjustedprobability distributions. Alternatively, the adjusted and/or unadjustedprobability distributions can be stored external to the drift detector6008, and retrieved by the drift detector 6008 when needed.

FIG. 74 is a flow diagram illustrative of an embodiment of a routine7400 implemented by the streaming data processor 308 to perform driftdetection in time-series data. Although described as being implementedby the streaming data processor 308, it will be understood that theelements outlined for routine 7400 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the drift detector 6008. Thus, thefollowing illustrative embodiment should not be construed as limiting.

At block 7402, variable i is set equal to 1. The variable i may indicatethe most-recently ingested raw machine data element.

At block 7404, a probability distribution for raw machine data i isdetermined. For example, the probability distribution may be determinedby the drift detector 6008 using the online Bayesian changepointdetection algorithm.

At block 7406, a probability distribution for any previous raw machinedata associated with a time outside a time window may be discarded. Forexample, determined probability distributions and/or the raw machinedata from which the probability distributions are generated may beassociated with a time, such as a time at which the raw machine dataoccurred or was generated. The time window may be defined as the last Nseconds, minutes, hours, days, weeks, etc. Discarding probabilitydistributions associated with raw machine data older than the definedtime window may minimize the number of operations performed to determinethe likelihood that the most-recently ingested raw machine data elementis a changepoint or transition point, and may reduce the amount ofcomputing resources (e.g., memory capacity) required to store and/orprocess determined probability distributions. Thus, the modified versionof the online Bayesian changepoint detection algorithm implemented bythe drift detector 6008 may use fewer computing resources and performfaster than the online Bayesian changepoint detection algorithmdisclosed in Adams.

At block 7408, variable k is set to equal the number of probabilitydistributions. For example, variable k may be equal to the number ofprobability distributions that remain after the discarding operation isperformed.

At block 7410, a determination is made as to whether variable k isgreater than 1. If variable k is greater than 1, then additionalprobability distributions remain that may need to be adjusted or updatedand the routine 7400 proceeds to block 7412. Otherwise, if variable k isnot greater than 1, then all remaining probability distributions mayhave been adjusted or updated, if necessary, and the routine 7400proceeds to block 7416.

At block 7412, a probability distribution for probability distribution kis updated using at least one of the probability distribution of rawmachine data i or the discarded probability distribution(s). Forexample, probability distribution k—which may correspond to a previouslyingested raw machine data element—may be updated to take into accountthe occurrence of raw machine data i. Probability distribution k mayalso be updated to take into account the probability distribution(s)that have been discarded or deleted. For example, the mean ofprobability distribution k may be updated such that the total mean ofthe remaining probability distributions would be equivalent to the meanif none of the discarded probability distribution(s) were actuallydiscarded (e.g., at least a portion of the discarded probabilitydistribution(s) may be shifted to probability distribution k).

At block 7414, variable k is decremented by 1. Variable k may bedecremented by 1 so that the next probability distribution can beoptionally updated. After variable k is decremented, the routine 7400reverts back to block 7410.

At block 7416, whether raw machine data i is corresponds to achangepoint is determined based on the probability distributions. Forexample, the drift detector 6008 can apply some or all of the optionallyupdated remaining probability distributions to the online Bayesianchangepoint detection algorithm to determine whether raw machine data iis likely to be a changepoint or transition point at which thedistribution of the time-series has shifted.

At block 7418, variable i is incremented by 1. Variable i may beincremented to represent that the next ingested raw machine data elementwill be evaluated to determine whether the next ingested raw machinedata element is a changepoint or transition point. After variable i isincremented, the routine 7400 reverts back to block 7404.

Fewer, more, or different blocks can be used as part of the routine7400. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 74 can be implemented in a variety of orders, or canbe performed concurrently. For example, probability distributions can bediscarded before the probability distribution for raw machine data i isdetermined.

4.16.5. Explainability

As described herein, anomalies can be detected in pipeline metrics, logsor events, or other fields present in ingested raw machine data. Whiledetecting and surfacing an anomaly to a user can be useful, the user maynot understand why the anomaly occurred in the first place. If there areissues with the data processing pipeline or ingested raw machine data,any delay in identifying the cause of an anomaly can cause downstreamdata processing issues and/or delays.

The anomaly explainer 6010 can reduce downstream data processing issuesand/or delays by identifying likely causes of detected anomalies Theanomaly explainer 6010 can implement none, some, or all of thefunctionality of the anomaly metric identifier 3410 described above inidentifying the likely causes. For example, the anomaly explainer 6010can provide explanations for anomalies detected in pipeline metrics,logs or events, or other fields present in ingested raw machine databased on patterns observed in logs or events or other fields present iningested raw machine data. Specifically, the anomaly explainer 6010 cancorrelate pipeline metrics, logs or events, or other field present iningested raw machine data identified as being anomalous with otherfields present in ingested raw machine data that have not beenidentified as being anomalous, and use the other fields not identifiedas being anomalous as a root cause analysis for explaining why a metric,log, event, or other field is observed as an outlier.

The anomaly explainer 6010 can be a component in a data processingpipeline that provides explanations for the occurrence of anomalies, asshown in FIG. 75 . As illustrated in FIG. 75 , raw machine data mayoriginate from a data stream source 7502, which may be internal orexternal to the data intake and query system 108. The raw machine datamay be transformed by zero or more data processing components 7504before being provided to the anomaly detector 3406 as an input. Theanomaly detector 3406 can transform the provided raw machine data (e.g.,by identifying an anomaly) and produce a corresponding output. Theanomaly explainer 6010 can transform the output (e.g., by identifyingone or more fields that may be correlated with another field beinganomalous) and produce a corresponding second output. Zero or more dataprocessing components 7506 can also transform the output produced by theanomaly detector 3406 before the optionally transformed output iswritten to an index 7508, such as the indexing system 212, and/or to anydata store present in the data intake and query system 108. Similarly,the second output can be written to the index 7508 or a different index,not shown. The anomaly explainer 6010 can produce the second outputasynchronously with the zero or more data processing components 7506transforming the output, and can produce the second output before,during, and/or after the zero or more data processing components 7506transform the output.

While the present disclose describes the anomaly explainer 6010 asdetermining an explanation for why an anomaly occurred, this is notmeant to be limiting. For example, the data processing pipeline mayinclude the pipeline metric outlier detector 3408 instead of the anomalydetector 3406, and therefore the anomaly explainer 6010 can produce anoutput explaining an anomaly detected in a pipeline metric instead of ina log or event. Similarly, the data processing pipeline may include theadaptive thresholder 6002, the sequential outlier detector 6004, thesentiment analyzer 6006, and/or the drift detector 6008 instead of theanomaly detector 3406. If the adaptive thresholder 6002 is present, theanomaly explainer 6010 can produce an output explaining an anomaly oroutlier detected in the time window. If the sequential outlier detector6004 is present, the anomaly explainer 6010 can produce an outputexplaining an anomaly in a sequence of logs or events. If the sentimentanalyzer 6006 is present, the anomaly explainer 6010 can produce anoutput explaining why a particular sentiment is detected (e.g., thetoken(s) that led to the detection of a particular sentiment). If thedrift detector 6008 is present, the anomaly explainer 6010 can producean output explaining why an ingested raw machine data element isdetermined or not determined to be a changepoint or transition point.

The anomaly explainer 6010 can receive from the anomaly detector 3406(or pipeline metric outlier detector 3408, adaptive thresholder 6002,sequential outlier detector 6004, sentiment analyzer 6006, driftdetector 6008, etc.) information identifying an anomalous token (e.g.,log, event, or other field in ingested raw machine data), including atimestamp corresponding to the anomalous token. The anomaly explainer6010 can obtain the ingested raw machine data in which the anomaloustoken is detected and extract one or more tokens from the ingested rawmachine data. In some embodiments, the anomaly explainer 6010 extractssome, but not all, of the non-anomalous tokens to reduce computingresource usage. In other embodiments, the anomaly explainer 6010extracts all of the non-anomalous tokens. The anomaly explainer 6010 cananalyze the extracted token(s) and store value(s) of the extractedtoken(s). The anomaly explainer 6010 may repeat this operation one ormore times the same type of token (e.g., the same field, the same log,the same event, etc.) is determined to be anomalous in subsequentingested raw machine data. Thus, the anomaly explainer 6010 may storeinformation indicating the values of non-anomalous tokens when a certaintype of token is determined to be anomalous. The anomaly explainer 6010can perform a statistic analysis on the non-anomalous token values todetermine if there are any correlations between one type of token beinganomalous and another type of token having a certain value or a certainrange of values. If a correlation exists, this might indicate that thecorrelated non-anomalous token having a certain value or a certain rangeof values causes the anomalous token to have an anomalous value. If nocorrelation exists, the anomaly explainer 6010 may extract additionaltokens from the ingested raw machine data and/or from the common storage216, and analyze these tokens to determine whether any correlationsexist. Thus, the anomaly explainer 6010 can extract some, but not all,tokens as the raw machine data is ingested to determine whethercorrelations exist with an anomalous token in an attempt to reducecomputing resource usage. If no correlations are detected after one ormore raw machine data elements are ingested, then the anomaly explainer6010 can extract additional tokens from ingested raw machine data and/orthe common storage 216 to determine whether correlations exist betweenthe additionally extracted tokens and the anomalous token. The anomalyexplainer 6010 can repeat this process zero or more times until acorrelation is identified and/or until all tokens have been extracted.

Once a correlation is identified, the anomaly explainer 6010 can use theidentified correlation to surface explanations. For example, when asubsequent raw machine data element is ingested and an anomaly isdetected, the anomaly explainer 6010 can extract one or morenon-anomalous tokens from the ingested raw machine data. In someembodiments, the anomaly explainer 6010 extracts some, but not all, ofthe non-anomalous tokens. For example, the anomaly explainer 6010 mayextract non-anomalous token(s) from the ingested raw machine data thatthe anomaly explainer 6010 had previously determined are correlated withthe anomalous token. In other embodiments, the anomaly explainer 6010extracts all of the non-anomalous tokens. The anomaly explainer 6010 canthen generate information identifying the non-anomalous token(s), ifany, that are correlated with the anomalous token, such as the valuesand types of the non-anomalous token(s), with an indication that theidentified non-anomalous token(s) are correlated with the anomaloustoken (e.g., an indication that there is a correlation between thenon-anomalous token(s) having a certain value or range of values and theanomalous token having an anomalous value).

The anomaly explainer 6010 or another component in the data intake andquery system 108 can generate user interface data that, when rendered bya client device 204, causes the client device 204 to display a userinterface depicting the surfaced explanation (e.g., informationidentifying the non-anomalous token(s), if any, that are correlated withthe anomalous token, with an indication that there is a correlationbetween the identified non-anomalous token(s) having certain value(s) orrange(s) of values and the anomalous token having an anomalous value).For example, the surfaced explanation may be displayed, in the userinterface, in the same tab or window as an identification of theanomalous token. As another example, the surfaced explanation may bedisplayed, in the user interface, in a different tab or window than anidentification of the anomalous token. In some embodiments, the userinterface can further provide (in a same or different tab or window asan identification of the anomalous token in the user interface) a visualand/or audible explanation of the determined correlation and/orpotential cause (e.g., a non-anomalous token having a certain value orrange of values) of the detected anomaly. Alternatively or in addition,the anomaly explainer 6010 can generate an alert identifying thecorrelation and/or the possible cause of the detected anomaly (e.g., anexplanation that certain non-anomalous token(s) having certain value(s)or range(s) of values may be the cause of the anomaly).

The anomaly explainer 6010 can use similar techniques to those describedherein to, for example, generate an explanation of why text isdetermined to have a particular sentiment or why a time-series datavalue is determined to be or not be a changepoint or transition point.For example, the anomaly explainer 6010 can use the extraction andstatistical operations to determine a correlation between a vectorhaving elements with certain hash values or tokens having certain valuesand text being assigned a certain rating or label or having a certainsentiment. As another example, the anomaly explainer 6010 can use theextraction and statistical operations to determine a correlation betweena time-series data value being a changepoint and a time at which thechangepoint is detected, a periodicity in which changepoints aredetected, etc., and to determine a correlation between a time-seriesdata value not being a changepoint and a time at which the changepointis not detected, a periodicity in which changepoints are not detected,etc.

FIG. 76 is a flow diagram illustrative of an embodiment of a routine7600 implemented by the streaming data processor 308 to explainanomalies Although described as being implemented by the streaming dataprocessor 308, it will be understood that the elements outlined forroutine 7600 can be implemented by one or more computingdevices/components that are associated with the intake system 210, suchas, but not limited to, the anomaly explainer 6010. Thus, the followingillustrative embodiment should not be construed as limiting.

At block 7602, one or more tokens are extracted from raw machine data.For example, the tokens may be extracted for the purpose of detectinganomalies in logs or events.

At block 7604, the token(s) are compared to a set of data patterns. Forexample, a vector may be generated using the token(s), and the vectormay be compared to the set of data patterns.

At block 7606, a first value of a first token is determined to beanomalous in response to the comparison. For example, the vector maycorrespond to and be assigned to one of the data patterns. However, thevalue of the first token may have been below a lower quantile or abovean upper quantile of first token values when compared with the values ofthe first tokens in other vectors assigned to the data pattern. Thus,the value of the first token may be considered anomalous.

At block 7608, a determination is made as to whether a correlation isidentified. For example, a correlation may be identified if anothertoken in the raw machine data from which the first token originatesconsistently has a certain value or a certain range of values when thefirst token is determined to have an anomalous value. The determinationmay be made on a first set of tokens extracted from the raw machinedata. The first set of tokens, however, may not be all of the tokenspresent in the raw machine data. If a correlation is identified, thenthe routine 7600 proceeds to block 7614. Otherwise, if no correlation isidentified, then the routine 7600 proceeds to block 7610.

At block 7610, additional token(s) are extracted. For example,additional token(s) may be extracted from the raw machine data, from thecommon storage 216, and/or from other data stores in the intake system210. The additionally extracted token(s) may be different than thosetokens originally extracted to identify a correlation. One or morevalues of the extracted token(s) may be obtained so that, for example, acorrelation analysis can be performed by the anomaly explainer 6010.

At block 7612, a determination is made as to whether a correlation isidentified using the additionally extracted token(s). For example, acorrelation may be identified if an extracted token consistently has acertain value or a certain range of values when the first token isdetermined to have an anomalous value. If a correlation is identified,then the routine 7600 proceeds to block 7614. Otherwise, if nocorrelation is identified, then the routine 7600 optionally reverts backto block 7610 so that additional token(s) can be extracted and analyzedfor identifying a correlation. For example, the routine 7600 may notrevert back to block 7610 if all tokens have been extracted, in whichcase no correlation may be identified.

At block 7614, information indicating that there is a correlationbetween the first token having an anomalous value and another tokenhaving another value is generated. For example, the information may bepresented in a user interface and/or in an alert transmitted to a clientdevice 204.

Fewer, more, or different blocks can be used as part of the routine7600. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 76 can be implemented in a variety of orders, or canbe performed concurrently. For example, additional token(s) may beextracted at block 7610 even if a correlation is identified at block7608 or 7612. Thus, multiple correlations may be identified and surfacedto a user and/or all tokens may be evaluated for potential correlationsbefore the routine 7600 completes.

4.16.6. Preview Mode

As described herein, a user can design a data processing pipeline. Insome cases, the user may want to preview how the data processingpipeline would operate if a new node or component was added to the dataprocessing pipeline before publishing the updated data processingpipeline to perform streaming processing, as this publishing can causedata to be written to various databases. This preview mode solveschallenges of existing graphical programming systems, in that thesesystems provide a user with a set of valid functions and allow the userto build and deploy a data flow. In fact, the preview mode can previewimplementation of the new component without fully deploying the updateddata processing pipeline (e.g., without disrupting an existing dataprocessing pipeline implemented by the intake system 210).

Typically, previewing the addition of a new component into the dataprocessing pipeline may include identifying whether the new component iscompatible with other components in the data processing pipeline and/orwhether addition of the new component causes any compiling errors. Thepreview may show the output of the new component using a preview set ofraw machine data (e.g., raw machine data ingested at a previous timeand/or raw machine data currently being ingested in an active dataprocessing pipeline), but the preview is generally limited to showingthe first N (e.g., 10, 20, 50, 100, etc.) outputs even if the previewset of raw machine data includes 10N, 100N, 1000N, etc. individual rawmachine data elements.

This type of preview may be inadequate if, for example, the newcomponent is a component designed to detect an anomaly, such as theanomaly detector 3406, the pipeline metric outlier detector 3408, theadaptive thresholder 6002, and/or the sequential outlier detector 6004.In some cases, an anomaly may be present in the first N outputs.However, an anomaly may not be present in the first N outputs in othercases. In fact, it may not be clear when an anomaly would actually occurin the preview set of raw machine data, so simply increasing the numberof outputs displayed in the preview may not resolve the issue. Thus, thepreview may not adequately inform a user as to whether the new componentproperly identifies anomalies and/or properly determines when an anomalyis not present when inserted into the data processing pipeline.

Accordingly, a preview mode is described herein in which outputs of anew component or any existing component can be generated and sampled,with the sampling of outputs being displayed in the preview rather thanan unfiltered listing of the outputs. For example, the outputs of thecomponent can be generated using the preview set of raw machine data.The outputs can be parsed to identify the different types of labelspresent therein, where a label can include an indication that an anomalyis detected, an indication that an anomaly is not detected, thetransformation of raw machine data into a different form (e.g.,transformation of personally identifiable information into a mask,transformation of personally identifiable information into a partialmask, etc.), a detected sentiment, an indication that a changepoint isdetected, an indication that a changepoint is not detected, and/or thelike. In some cases, some label types may occur more often than otherlabel types. The occurrence or number of each type of label can becounted or tracked. The labels, however, can then be sampled such that asimilar number (e.g., equal number) of each type of label is obtained,and the sampled labels can then be displayed in the preview. By samplinga similar number of the different types of labels rather than simplydisplaying the first N labels, the labels that occur less often may notbe dwarfed or obscured by the labels that occur more often. Thus, all ofthe different types of labels, not just some of the different types oflabels, can then be surfaced to a user.

Because the preview mode is intended to preview the operations of a nodeor component in the data processing pipeline, the preview mode mayinclude a timeout feature. For example, the node or component cangenerate outputs using the preview set of raw machine data until afinite period of time passes, a finite period of time after the initialoutput data was generated has passed, a certain number of outputs havebeen generated, and/or the like. Once the timeout period is triggered orexpires (e.g., a finite period of time passes, a certain number ofoutputs have been generated, etc.), the stream of raw machine data maybe disabled or stopped from being applied as an input to the node orcomponent. The timeout period may be the same or different than theperiod of time covered by the preview set of raw machine data. In someembodiments, the node or component can generate outputs using thepreview set of raw machine data until a particular type of label has notbeen surfaced for a finite period of time. Thus, the stream of rawmachine data may be disabled or stopped from being applied as an inputto the node or component after a certain amount of time even if aparticular type of label is not detected.

FIG. 77 is a block diagram of one embodiment a graphical programmingsystem 7700 that provides a graphical interface for designing dataprocessing pipelines, in accordance with example embodiments. Asillustrated by FIG. 77 , the graphical programming system 7700 caninclude an intake system 210, similar to that described above withreference to FIGS. 3A and 3B. In FIG. 77 , the intake system 210 isdepicted as having additional components that communicate with graphicaluser interface (“GUI”) pipeline creator 7720, including functionrepository 7712 and processing pipeline repository 7714. The functionrepository 7712 includes one or more physical storage devices that storedata representing functions (e.g., a construct or command) that can beimplemented by the streaming data processor 308 to manipulateinformation from an intake ingestion buffer 306, as described herein.The processing pipeline repository 7714 includes one or more physicalstorage devices that store data representing processing pipelines, forexample processing pipelines created using the GUIs described herein. Aprocessing pipeline representation stored by the processing pipelinerepository 7714 include an abstract syntax tree or AST, and each node ofthe AST can denote a construct or command occurring in the pipeline. AnAST can be a tree representation of the abstract syntactic structure ofsource code written in a programming language. Each node of the tree candenote a construct occurring in the source code. Examples of AST-basedprocessing are described in U.S. patent application Ser. No. 15/885,645,titled “DYNAMIC QUERY PROCESSOR FOR STREAMING AND BATCH QUERIES,” filedJan. 31, 2018, the entirety of which is hereby incorporated by referenceherein.

The GUI pipeline creator 7720 can manage the display of graphicalinterfaces as described herein, and can convert visual processingpipeline representations into ASTs for use by the intake system 210. TheGUI pipeline creator 7720 can be implemented on one or more computingdevices. For example, some implementations provide access to the GUIpipeline creator 7720 to client devices 204 remotely through network208, and the GUI pipeline creator 7720 can be implemented on a server orcluster of servers. The GUI pipeline creator 7720 includes a number ofmodules including the display manager 7722, preview module 7724,recommendation module 7726, and pipeline publisher 7728. These modulescan represent program instructions that configure one or moreprocessor(s) to perform the described functions.

The display manager 7722 can generate instructions for rendering agraphical processing pipeline design interface, for example theinterfaces depicted in the illustrative embodiments of the drawings. Inone embodiment, the instructions include markup language, such ashypertext markup language (HTML). The display manager 7722 can sendthese instructions to a client device 204, which can in turn display theinterface to a user and determine interactions with features of the userinterface. For example, the display manager 7722 may transmit theinstruction via hypertext transport protocol, and the client device 204may execute a browser application to render the interface. The displaymanager 7722 can receive indications of the user interactions with theinterface and update the instructions for rendering the interfaceaccordingly. Further, the display manager 7722 can log the nodes andinterconnections specified by the user for purposes of creating acomputer-readable representation of the visually programmed processingpipeline designed via the interface.

The preview module 7724 can manage the display of previews of dataflowing through the described processing pipelines. For example, thepreview module 7724 can replace write functions with preview functionsand add preview functions to other types of functions, where suchpreview functions capture a specified quantity of data output byparticular nodes and also prevent deployment of an in-progress pipelinefor writing to external systems. The preview module 7724 can communicatewith the display manager 7722 to generate updates to the disclosedgraphical interfaces that reflect the preview data.

The recommendation module 7726 can analyze various elements of dataprocessing pipelines in order to recommend certain changes to userscreating the pipelines. These changes can include, in variousembodiments, entire pre-defined templates, filtered subsets of nodescompatible with upstream nodes, specific recommended nodes, andconditional branching recommendations. The recommendation module 7726can implement machine learning techniques in some implementations inorder to generate the recommendations, as described in further detailbelow. The recommendation module 7726 can access historical data for aparticular user or a group of users in order to learn whichrecommendations to provide.

The pipeline publisher 7728 can convert a visual representation of aprocessing pipeline into a format suitable for deployment, for examplean AST or a form of executable code. The pipeline publisher 7728 canperform this conversion at the instruction of a user (e.g., based on theuser providing an indication that the pipeline is complete) in someimplementations. The pipeline publisher 7728 can perform this conversionto partially deploy an in-progress pipeline in preview mode in someimplementations.

FIG. 78 is an interface diagram of an example user interface 7800 forpreviewing a data processing pipeline 7810 being designed in the userinterface 7800, in accordance with example embodiments. The depictedexample processing pipeline 7810 corresponds to the first branch of adata processing pipeline.

In some implementations, the user interface 7800 can include aselectable feature 7820 that activates a preview mode. In otherimplementations, the preview mode can be activated each time the userspecifies a new node or interconnection for the processing pipeline7810. Activation of the preview mode can implement the in-progresspipeline on the intake system 210 in a manner that captures realinformation about node processing behavior without fully deploying thepipeline for writing to the specified data destinations (here, index1).

In order to semi-deploy the processing pipeline in this manner,activation of the preview mode, as described in further detail below,can transform the AST of the pipeline by adding functions that capturethe messages published by the various nodes and prevent writing data toany external databases. This allows the preview to operate on live datastreamed from the source(s) without affecting downstream systems, sothat the user can determine what the processing pipeline is doing toactual data that flows through the system.

The preview mode can update the user interface 7800 with a previewregion 7830. Alternatively, the preview region 7830 may be depicted in atab of the user interface 7800 separate from a tab depicting theprocessing pipeline 7810 or the selectable feature 7820 that activates apreview mode. Similarly, the preview region 7830 can be depicted in thesame window of the user interface 7800 or a different window of the userinterface 7800 as the selectable feature 7820 that activates a previewmode. Initially, the preview region 7830 may be populated with a visualrepresentation of data streaming from the source(s). A user can selectan individual node (here depicted as anonymizer node 7811) in the userinterface 7800 to preview the data output by that node. The visualrepresentation of that node may be changed (e.g., with a border,highlighting, or other visual indication) to show which node is beingpreviewed in the current interface.

The preview region 7830 can display a sampling of the different types oflabels output by the node. In some embodiments, the sampling of labeltypes that are displayed may be those that are outputted before atimeout occurs. The depicted example shows 6 labels, but this can changedepending on the number of different types of labels that are present inthe stream of raw machine data. A sampling of the labels output by node7811 is displayed in the example user interface in region 7832, whichhere shows a label type followed by objects identified bydeserialization (host device, data source, source type, data kind, and abody of the data) that correspond to the label type.

The anonymizer node 7811 may be designed to convert personallyidentifiable information into masked text. A user may be interested indetermining whether the anonymizer node 7811 operates as designed orwhether there are flaws in the design. For example, a flaw could be thatsocial security numbers are not fully masked, telephone numbers are notmasked properly, email addresses are not masked properly, and/or thelike. As depicted in the region 7832, the first 4 label types may be“XXX-XX-XXXX,” “XXX-XX-XXXX3,” “XXXXXX@abc.com,” and XXXXabcX@abc.com.”Because the label types lack any partially masked social securitynumbers, this may indicate that the anonymizer node 7811 masks socialsecurity numbers appropriately. However, the “XXX-XX-XXX3” label typeappears to indicate that the anonymizer node 7811 thinks phone numbersare social security numbers, and therefore only masks the first 9 digitsof phone numbers. Similarly, it appears that the anonymizer node 7811properly masks email addresses when the email domain is not presentbefore the “@” symbol, but does not properly mask email addresses whenthe email domain is present before the “@” symbol. On the other hand, ifthe region 7832 simply depicted the first N outputs, then it is possiblethat the user may not have come across one of the above-identified flawsin the design of the anonymizer node 7811 because the raw machine datathat results in one of the flaws may not have been ingested at the timethe user selected the preview mode and/or may not have been ingesteduntil well after the preview mode had been run and ended.

The region 7832 can be populated with data captured by a previewfunction associated with the node 7811, and can be updated as the userselects different nodes in the processing pipeline 7810. The graphicalinterface can include selectable options to end the preview, or the usermay end the preview by modifying or publishing the pipeline.

Although not illustrated in FIG. 78 , the preview user interface 7800may also include interactive features (e.g., input fields, a slidablefeature on a timeline, etc.) that enable the user to specify timeperiods for preview mode. Many of the preview examples described hereinrelate to preview of real-time data flowing through a draft processingpipeline. However, in some scenarios this may not be desirable, becauseas a user changes the pipeline the user may want to see how thesechanges effect one set of data, because if the data shown in the previewinterface is ever changing the user might have trouble locking in theprocessing flow. Thus, the preview user interface 7800 may have featuresthat enable a user to input a time window that specifies what messagesof each source should be processed. The intake ingestion buffer mightmaintain messages for a set period (e.g., 24 hours), and for someimplementations of the preview mode a user may “go back in time” toprocess messages rather than process streaming data. The preview userinterface 7800 may have features that allow the user to specify an endtime to “replay” a stream of messages from the past.

For full deployment, a user might want to just deploy their processingpipeline for new (not yet processed) messages, or the user may also wantto use the pipeline to process previous messages. For example, a user'scurrent pipeline may have done something wrong. In order to fix it, theuser can instruct the system to start again from 24 hours prior torecapture data that would otherwise be missed. In these instances, theolder data may have already been processed using a previous pipeline. Assuch, the intake system 210 may tag data that is being reprocessedaccording to a new pipeline as potentially duplicative, such that adownstream system can understand that the data could be the same as datareceived based on a prior pipeline. The intake system 210 may tag thereprocessed data as authoritative, such that a downstream system canmark data from the same period but a different pipeline as deprecated.

Some implementations of the preview mode may also display performancemetrics of each node, for example as a graphical representationdisplayed on the node or within the area of the node. Performancemetrics including number of events flowing in and out of the node,quantity of bytes flowing in and out of the node, and latency-relatedvalues (e.g., p99 and average latency) can be displayed on the node. Thepreview interface can include the graphical representation of thepipeline, as in FIG. 78 , with each node including a graphicalrepresentation of performance metric data.

FIG. 79A is a block diagram of a graph representing a data processingpipeline 7900A, in accordance with example embodiments. The processingpipeline 7900A includes one source (read-source), two branches, and twodestinations (write-stateless-indexer) with various transform nodesalong the branches (filters, projection). A projection is a list of keysthat selects resource data values. The data processing pipeline 7900Acan be specified graphically by a user via the GUI pipeline creator7720, as described herein.

FIG. 79B is a block diagram of the graph of FIG. 79A having added nodesto facilitate the disclosed data processing pipeline previews, inaccordance with example embodiments. These preview nodes are illustratedby the dashed line nodes labeled with “limit+preview”. In response toactivation of the preview mode, the preview module 7724 can analyze thenodes of the specified pipeline and perform a rewrite pass on the AST ofthe pipeline. In some implementations, the rewrite functionality of thepreview module 7724 can be implemented on the backend intake system 210.During the rewrite pass, the preview module 7724 can replace any sync orwrite functions (e.g., functions that write data to external systems)with a function that drops the data. This is because when a user runs apreview the user may not want to index data because the user may stillbe developing a draft pipeline, and as such it would be undesirable toaffect long-term storage systems with data from a draft pipeline. Thisis shown in FIG. 79B by replacing the “write-stateless-indexer”functions with the “write-null” functions. Also, during the rewritepass, the preview module 7924 can (for every other function in thegraph) add an additional function that performs the limit+previewfunction. This can pull a specified quantity of data published to thetopic of that node to show the user a preview of this data. The limitcan be enforced with the goal of not overwhelming the user with toolarge a quantity of streaming data.

As shown in FIG. 79B, these preview nodes can be added in new branchesto preserve the original interconnections between nodes. As such, theend result after the rewrite pass is the initial graph plus additionalbranches that lead to the functions responsible for handling previews ofdata. The preview mode can then include running a preview job as aregular job after the rewrite step. Due to the newly added previewnodes, when data leaves the nodes specified by the user, the data issent along the new branches to the preview functions, which can samplethe data. The preview functions can be configured with an upstreamidentifier function so that sampled data displayed during the previewmode can be annotated with its source. The preview functions can pushcaptured records back to the GUI pipeline creator 7720, for example by aREST endpoint, for storage in a memory that can be accessed during thepreview mode. The GUI pipeline creator 7720 can then pull the data, forexample by another REST endpoint, for records that have been previewed.As a result, for the end user it can appear as if data or a samplingthereof is flowing into the user interface from the source(s).

FIG. 80 is a flow diagram depicting illustrative interactions forgenerating data processing pipeline previews, in accordance with exampleembodiments. The interactions 8000 occur between a client device 204,the GUI pipeline creator 7720, and the intake system 210.

At (1), the client device 204 sends a request to activate the previewmode to the frontend GUI pipeline creator 7720. In response, at (2) theGUI pipeline creator 7720 sends the AST of the currently specifiedprocessing pipeline to the backend intake system 210.

At (3), the intake system 210 can perform the rewrite processingdescribed above that causes any functions that write to externaldatabases to drop their data rather than write it to the externaldatabase, and that adds new branches with preview nodes for capturingdata output by the individual nodes of the processing pipeline. It willbe appreciated that other implementations may perform the rewriteprocessing at the GUI pipeline creator 7720. The rewrite step canproduce an augmented AST including additional branches and previewnodes, as described above with respect to FIG. 79B.

At (4), the intake system can run a job using the augmented AST. Whilethis job is running on live data streamed from the specified source(s),when data leaves the nodes specified by the user it is sent along thenew branches to the preview functions. At (5), the preview functionscapture records, such as labels produced by the nodes specified by theuser. The occurrence of labels produced by the nodes specified by theuser may vary widely by label type, with some label types occurringoften and other label types occurring less often. At (6), the previewfunctions can sample the captured records and push a sampling of thecaptured records back to the GUI pipeline creator 7720, for example by aREST endpoint, for storage in a memory that can be accessed during thepreview mode. For example, while some label types may occur often andother label types less often, the sampling may be of an equal number ofeach label type. Thus, the sampling of the captured records may includethe same or similar number of each type of label produced by the nodesspecified by the user regardless of the actual frequency of the labeltypes. In some implementations, the preview nodes can also capturemetrics such as processing resources and processing time of individualnodes. The GUI pipeline creator 7720 can then pull another REST endpointfor a sampling of records that have been captured at (7) to generate thepreview GUI.

At (8), the GUI pipeline creator 7720 can send the preview GUI to theclient device 204. Some implementations may display a single previewinterface that depicts a sampling of data captured from each node in thepipeline. In other implementations, the preview mode can be configuredto display a sampling of the data of a single node at a time, forexample to present a more compact visual preview, and thus at (9) theuser may select a particular node for which they would like to previewdata. At (10) the client device 204 sends an indication of the selectednode to the GUI pipeline creator 7720, which at (11) can poll a RESTendpoint for a sampling of records (or pull a sampling of records fromthe REST endpoint) that have been captured from the selected node. At(12), the GUI pipeline creator 7720 can send the updated preview GUI tothe client device 204. Interactions (9) through (12) may be repeated anumber of times as the user previews a sampling of data output by someor all nodes in the pipeline.

With reference to FIG. 81 , an illustrative algorithm or routine 8100implemented by the graphical programming system 7700 to generate dataprocessing pipeline previews will be described in the form of aflowchart. The routine 8100 begins at block 8102, where the GUI pipelinecreator 7720 provides a GUI through which a user can program operationof a data processing pipeline by specifying a graph or tree of nodesthat transform data, as well as interconnections that designate routingof data between individual nodes within the graph. This GUI can includethe user interface 7800, node addition options, and/or thepreview/recommendation features described herein.

At block 8104, the GUI pipeline creator 7720 receives specification ofthe graph of nodes and interconnections, for example from a clientdevice that displays the GUI. The nodes can include one or more datasources that send data along the interconnections to one or more datadestinations, optionally with transform nodes disposed between thesource(s) and destination(s). This specified pipeline may be a draft orin-progress pipeline that the user has currently configured using thevisual interface, rather than a finalized pipeline that is ready fordeployment on the intake system 210.

At block 8106, the GUI pipeline creator 7720 can activate a preview modethat causes the data processing pipeline to retrieve data from at leastone source specified by the graph, transform the data according to thenodes of the graph, sample the transformed data, and display thesampling of the transformed data of at least one node without writingthe transformed data (or the sampling thereof) to at least onedestination specified by the graph. As described above, this can involverewriting an AST representing the draft pipeline to replace syncfunctions and add preview functions to all other nodes, which may beperformed by the GUI pipeline creator 7720 or the intake system 210. Theintake system 210 can then use this augmented AST to run a job thatpulls data streaming from the specified source(s) into the pipeline andcaptures records of data output by each node using the previewfunctions. The preview functions can sample the data output by eachnode, and the GUI pipeline creator 7720 can then pull the sampling ofthese captured records to populate the preview interface, giving theimpression to the user that live streaming data is flowing into theinterface from the source, while preventing the writing of data toexternal storage systems.

Fewer, more, or different blocks can be used as part of the routine8100. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 81 can be implemented in a variety of orders, or canbe performed concurrently.

4.16.7. A/B Testing and Algorithm Swapping

As described herein, a user can design a data processing pipeline. Insome cases, the data processing pipeline can include a machine learningmodel as one component in the data processing pipeline. The machinelearning model may be trained and/or re-trained using a first type ofmachine learning algorithm. However, another type of machine learningalgorithm may be later developed that improves upon the first type ofmachine learning algorithm. Typically, if the user desires to swap thefirst type of machine learning algorithm with the improved type ofmachine learning algorithm, such a swap may involve re-training themachine learning model using all of the raw machine data previouslyingested in the data processing pipeline and the improved type ofmachine learning algorithm. Performing this re-training can be computingresource intensive and cause delays in downstream nodes or components ofthe data processing pipeline.

At least one reason why swapping machine learning algorithms in the dataprocessing pipeline may cause the re-training to occur is becausemachine learning algorithms and machine learning model state (e.g.,weights, parameters, hyperparameters, etc. of a machine learning model)are typically tied together. For example, the machine learning algorithmcode may include both transformation operations and variables definingthe model state. If the machine learning algorithm code were to bereplaced with new code, then the model state would be lost (e.g.,because the variables defining the model state would be erased oroverwritten), thereby resulting in a new machine learning model havingto be trained.

It can also be difficult to determine whether an existing machinelearning algorithm should be replaced with a different machine learningalgorithm. For example, because the existing machine learning algorithmmay be operating on a live stream of raw machine data as the stream isingested, resulting in data being written to external storage systems,it may not be practical to test one or more machine learning algorithmsusing the live stream of raw machine data in real-time. Rather, the oneor more machine learning algorithms may be tested using the live streamof raw machine data at some later time after the live stream has beentransformed and written to external storage systems. This delay intesting, however, can prevent improved machine learning algorithms frombeing deployed sooner

Accordingly, a machine learning model testing and swapping system isdescribed herein in which machine learning algorithms and model statesare separated. For example, various machine learning algorithms may bestored in the streaming data processor(s) 308. The model state (orvariables defining the model state), however, may be stored in anexternal location, such as in the processing pipeline repository 7714,in a separate location within the streaming data processor(s) 308, or inanother data store of the intake system 210. The machine learningalgorithm code may be designed to include transformation operations andreferences to the storage location of the model state rather thanvariables defining the model state. In this way, swapping machinelearning algorithms may not involve re-training a machine learning modelusing all of the raw machine data previously ingested in the dataprocessing pipeline and the swapped machine learning algorithm. Rather,because the model state is stored external to the machine learningalgorithm code, the machine learning algorithm code referencing themodel state storage location can be swapped with another machinelearning algorithm code referencing the model state storage location. Inother words, the transformation operations that define the machinelearning algorithm may change, but the model state may not be lost ordeleted during the swap because the model state is stored externally andcan simply be retrieved by the new machine learning algorithm from theexternal storage location.

In addition, the machine learning model testing and swapping systemdescribed herein allows for any number of machine learning algorithms tobe tested in parallel with an existing machine learning algorithm (e.g.,A/B testing). For example, a user can design a data processing pipelinein a manner as described herein in which the design includes an existingmachine learning model trained by an existing machine learning algorithmis implemented by a node or component in the data processing pipeline,with the existing machine learning model operating on a live stream ofraw machine data and having its output eventually written to externalstorage systems. The design can further include one or more machinelearning models trained by one or more machine learning algorithms beingtested also operating on the live stream of raw machine data. The testmachine learning model(s), however, may be implemented by node(s) inbranches of the data processing pipeline that do not end with any databeing written to external storage systems. Thus, an existing machinelearning algorithm and one or more test machine learning algorithms canbe run in parallel on the same data. The outputs of the models trainedby these machine learning algorithms can then be compared to determinewhich model produces the most accurate results. If a machine learningalgorithm being tested turns out to be more accurate than the existingmachine learning algorithm, then the algorithms can be swapped withoutany downtime or delay in the data processing pipeline and without losingthe model state.

FIG. 82 is a block diagram of a graph representing a data processingpipeline 8200, in accordance with example embodiments. As illustrated inFIG. 82 , the data processing pipeline 8200 includes a read-source fromwhich a stream of raw machine data originates. The stream of raw machinedata may eventually pass through to machine learning model 8202, whichis trained and/or re-trained by the machine learning algorithm 8212. Thestream of raw machine data may also pass through to machine learningmodel 8204, which is trained and/or re-trained by the machine learningalgorithm 8124.

In some embodiments, the machine learning model 8202 and not the machinelearning model 8214 was originally present in the data processingpipeline 8200. The user, however, may have modified the data processingpipeline 8200 using the techniques and/or user interface described aboveto test the machine learning algorithm 8214 to see if the machinelearning algorithm 8214 is better than the machine learning algorithm8212. As a result, an output of the machine learning model 8202eventually passes through to external storage systems, such asdestination data store 8206. The machine learning model 8204, however,is positioned within a branch of the data processing pipeline 8200 thatdoes not result in any writes to external storage systems. Thus, themachine learning algorithm 8214 can be tested without any outputs of themachine learning model 8204 accidentally being stored in an externalstorage system.

As described herein, the machine learning algorithms 8212 and 8214 maynot store the model state (e.g., model parameters) internally. Rather,the model state may be stored in the processing pipeline repository 7714or another data store. Thus, the machine learning algorithms 8212 and8214 may communicate with the processing pipeline repository 7714 toobtain model state information, and use the stream of raw machine dataand/or the model state information to train and/or re-train the machinelearning models 8202 and 8204, respectively.

FIG. 83 is another block diagram of a graph representing the dataprocessing pipeline 8200, in accordance with example embodiments. Asillustrated in FIG. 83 , the machine learning algorithm swapper 6012 cantest the performance of the machine learning algorithms 8212 and 8214and optionally swap the existing machine learning algorithm 8212 withthe test machine learning algorithm 8214 if the test machine learningalgorithm 8214 has better performance.

For example, the machine learning algorithms 8212 and 8214 can be testedin parallel for a finite period of time, until each has produced acertain number of outputs, until each has taken a certain number of rawmachine data elements as inputs, and/or the like. Once the testingperiod is complete, the machine learning algorithm swapper 6012 canevaluate the performance For example, the machine learning algorithmswapper 6012 may be positioned in a branch of the data processingpipeline 8200 and can receive output 8302 from the machine learningmodel 8202 and output 8304 from the machine learning model 8204. Theoutputs 8302 and 8304 may be produced as a result of a particular rawmachine data element being ingested and provided to the machine learningmodels 8202 and 8204, respectively, as an input.

Separately, the machine learning algorithm swapper 6012 can obtain alabel 8312 that may represent an actual value resulting from the rawmachine data element being ingested. Thus, the machine learningalgorithm swapper 6012 can use the label 8312 to determine which output8302 or 8304 is closer to the actual value (e.g., label 8312). In otherwords, the machine learning algorithm swapper 6012 can use the label8312 to determine which machine learning model 8202 or 8204 as a lowerloss (e.g., a smaller difference between the prediction and actualvalues). If the output 8304 is closer to the actual value (e.g., themachine learning model 8204 is more accurate, has a lower loss, etc.),the machine learning algorithm swapper 6012 may swap the machinelearning algorithm 8212 with the machine learning algorithm 8214 giventhat the machine learning algorithm 8214 produces more accurate modelsthan the machine learning algorithm 8212. The swap may include themachine learning algorithm swapper 6012 replacing the machine learningalgorithm 8212 code with the machine learning algorithm 8214 code,replacing the transformation operations included in the machine learningalgorithm 8212 code with the transformation operations included in themachine learning algorithm 8214 code (but not replacing the reference inthe machine learning algorithm 8212 code to the storage location of themodel state of the machine learning model 8202), and/or the like. Themachine learning algorithm swapper 6012 can perform the swap inreal-time, without any data processing pipeline 8200 downtime. Onceswapped, the machine learning algorithm 8214 may begin re-training thelatest version of the machine learning model 8202. Alternatively, themachine learning model 8204 may also be swapped in place of the machinelearning model 8202, and the machine learning algorithm 8214 may beginre-training the latest version of the machine learning model 8204.

In other embodiments, the machine learning model 8202 may operate in aproduction stack (or active environment) and the machine learning model8204 may operate in a test stack (or background environment). Swappingthe two models may include the machine learning algorithm swapper 6012swapping the machine learning model 8202 for the machine learning model8204 in the production stack.

In further embodiments, the machine learning algorithm swapper 6012compares multiple outputs generated by the machine learning models 8202and 8204 to determine which algorithm is performing better. Thus, themachine learning algorithm swapper 6012 may obtain multiple labels 8312in order to evaluate the performance (e.g., accuracy) of the algorithms8212 and 8214.

While FIGS. 82-83 depict one machine learning algorithm 8214 beingtested, this is not meant to be limiting. Any number of machine learningalgorithms can be tested in parallel with an existing machine learningalgorithm 8212.

FIG. 84 is a flow diagram illustrative of an embodiment of a routine8400 implemented by the streaming data processor 308 to test and swapmachine learning algorithms. Although described as being implemented bythe streaming data processor 308, it will be understood that theelements outlined for routine 8400 can be implemented by one or morecomputing devices/components that are associated with the intake system210, such as, but not limited to, the machine learning algorithm swapper6012. Thus, the following illustrative embodiment should not beconstrued as limiting.

At block 8402, a first version of a model is generated using raw machinedata, a first machine learning algorithm, and a trained model forprocessing raw machine data obtained from an event data stream. Forexample, the first version of the model may produce outputs that may betransformed zero or more times and written to external storage systems.As another example, the first version of the model may be implementedwithin a production stack operation on live data.

At block 8404, a second version of the model is generated using the rawmachine data, a second machine learning algorithm, and the trainedmodel. For example, the second version of the model may produce outputsthat are not transformed or written to external storage systems. Rather,the second version of the model may be present in a branch of the dataprocessing pipeline that does not result in data being written toexternal storage systems. As another example, the second version of themodel may be implemented within a test stack separate from a productionstack. The second machine learning algorithm may be being tested by auser, and the second machine learning algorithm may start with the modeltrained by the first machine learning algorithm as a starting pointbefore re-training occurs (e.g., using the raw machine data). The firstand second versions of the model may be generated in parallel. Thus, A/Btesting may be performed in which the second version of the model istested (e.g., in a test stack, in a background environment, etc.) whilethe first version of the model is in production (e.g., in a productionstack, in an active environment in which transforms are performed onlive data, etc.).

At block 8406, an accuracy of the first version of the model is comparedwith an accuracy of the second version of the model on a particular setof data. For example, each model may receive individual data from theset as inputs over time and produce corresponding outputs. The producedoutputs can then be compared with the actual or expected outputs todetermine which model produced more accurate outputs.

The machine learning algorithm swapper 6012 may determine, some timeperiod after the second version of the model is generated, whether tocontinue writing transformed data based on the first version of themodel to the external storage systems or whether to begin writingtransformed data based on the second version of the model (or otherversions of the model being tested) to the external storage systemsinstead. Once the machine learning algorithms swapper 6012 determinesthat it is time to decide which transformed data to write to theexternal storage systems going forward, then the machine learningalgorithm swapper 6012 may begin to compare the accuracy of the modelsand/or algorithms.

At block 8408, the second version of the model is determined to be moreaccurate than the first version of the model. For example, the outputsof the second version of the model may have been closer to the actual orexpected outputs than the outputs of the first version of the model.

At block 8410, subsequent raw machine data obtained from the event datastream is processed using the second version of the model. For example,the first machine learning algorithm may be replaced with the secondmachine learning algorithm such that the second machine learningalgorithm will be used to train models that produce output written toexternal storage systems going forward. The second machine learningalgorithm may have trained the second version of the model during thetesting phase, and can start using the second version of the model on alive stream of raw machine data. In particular, outputs of the secondversion of the model may now be transformed zero or more times andwritten to external storage systems. Alternatively, the first version ofthe model may continue to be used to transform the live stream of rawmachine data, but the second machine learning algorithm (and not thefirst machine learning algorithm) may begin to re-train the firstversion of the model going forward. For example, the transformationoperations included in the first machine learning algorithm code may beswapped with the transformation operations included in the secondmachine learning algorithm code. Thus, the transformation operations maybe updated, but code may still reference a storage location of theparameters of the first version of the model.

In further embodiments, the second machine learning algorithm may bedesigned such that the algorithm weights more-recent raw machine datamore than less-recent raw machine data. Thus, the weighting may resultin the improvements of the second machine learning algorithm morequickly refining the model parameters of the machine learning modelbeing trained.

Fewer, more, or different blocks can be used as part of the routine8400. In some cases, one or more blocks can be omitted. Furthermore, itwill be understood that the various blocks described herein withreference to FIG. 84 can be implemented in a variety of orders, or canbe performed concurrently. For example, the second version of the modelcan be generated before the first version of the model.

4.17. Other Architectures

In view of the description above, it will be appreciate that thearchitecture disclosed herein, or elements of that architecture, may beimplemented independently from, or in conjunction with, otherarchitectures. For example, the Parent Applications disclose a varietyof architectures wholly or partially compatible with the architecture ofthe present disclosure.

Generally speaking one or more components of the data intake and querysystem 108 of the present disclosure can be used in combination with orto replace one or more components of the data intake and query system108 of the Parent Applications. For example, depending on theembodiment, the operations of the forwarder 204 and the ingestion buffer4802 of the Parent Applications can be performed by or replaced with theintake system 210 of the present disclosure. The parsing, indexing, andstoring operations (or other non-searching operations) of the indexers206, 230 and indexing cache components 254 of the Parent Applicationscan be performed by or replaced with the indexing nodes 404 of thepresent disclosure. The storage operations of the data stores 208 of theParent Applications can be performed using the data stores 412 of thepresent disclosure (in some cases with the data not being moved tocommon storage 216). The storage operations of the common storage 4602,cloud storage 256, or global index 258 can be performed by the commonstorage 216 of the present disclosure. The storage operations of thequery acceleration data store 3308 can be performed by the queryacceleration data store 222 of the present disclosure.

As continuing examples, the search operations of the indexers 206, 230and indexing cache components 254 of the Parent Applications can beperformed by or replaced with the indexing nodes 404 in some embodimentsor by the search nodes 506 in certain embodiments. For example, in someembodiments of certain architectures of the Parent Applications (e.g.,one or more embodiments related to FIGS. 2, 3, 4, 18, 25, 27, 33, 46 ),the indexers 206, 230 and indexing cache components 254 of the ParentApplications may perform parsing, indexing, storing, and at least somesearching operations, and in embodiments of some architectures of theParent Applications (e.g., one more embodiments related to FIG. 48 ),indexers 206, 230 and indexing cache components 254 of the ParentApplications perform parsing, indexing, and storing operations, but donot perform searching operations. Accordingly, in some embodiments, someor all of the searching operations described as being performed by theindexers 206, 230 and indexing cache components 254 of the ParentApplications can be performed by the search nodes 506. For example, inembodiments described in the Parent Applications in which worker nodes214, 236, 246, 3306 perform searching operations in place of theindexers 206, 230 or indexing cache components 254, the search nodes 506can perform those operations. In certain embodiments, some or all of thesearching operations described as being performed by the indexers 206,230 and indexing cache components 254 of the Parent Applications can beperformed by the indexing nodes 404. For example, in embodimentsdescribed in the Parent Applications in which the indexers 206, 230 andindexing cache components 254 perform searching operations, the indexingnodes 404 can perform those operations.

As a further example, the query operations performed by the search heads210, 226, 244, daemons 210, 232, 252, search master 212, 234, 250,search process master 3302, search service provider 216, and querycoordinator 3304 of the Parent Applications, can be performed by orreplaced with any one or any combination of the query system manager502, search head 504, search master 512, search manager 514, search nodemonitor 508, and/or the search node catalog 510. For example, thesecomponents can handle and coordinate the intake of queries, queryprocessing, identification of available nodes and resources, resourceallocation, query execution plan generation, assignment of queryoperations, combining query results, and providing query results to auser or a data store.

In certain embodiments, the query operations performed by the workernodes 214, 236, 246, 3306 of the Parent Applications can be performed byor replaced with the search nodes 506 of the present disclosure. In someembodiments, the intake or ingestion operations performed by the workernodes 214, 236, 246, 3306 of the Parent Applications can be performed byor replaced with one or more components of the intake system 210.

Furthermore, it will be understood that some or all of the components ofthe architectures of the Parent Applications can be replaced withcomponents of the present disclosure. For example, in certainembodiments, the intake system 210 can be used in place of theforwarders 204 and/or ingestion buffer 4802 of one or more architecturesof the Parent Applications, with all other components of the one or morearchitecture of the Parent Applications remaining the same. As anotherexample, in some embodiments the indexing nodes 404 can replace theindexer 206 of one or more architectures of the Parent Applications withall other components of the one or more architectures of the ParentApplications remaining the same. Accordingly, it will be understood thata variety of architectures can be designed using one or more componentsof the data intake and query system 108 of the present disclosure incombination with one or more components of the data intake and querysystem 108 of the Parent Applications.

Illustratively, the architecture depicted at FIG. 2 of the ParentApplications may be modified to replace the forwarder 204 of thatarchitecture with the intake system 210 of the present disclosure. Inaddition, in some cases, the indexers 206 of the Parent Applications canbe replaced with the indexing nodes 404 of the present disclosure. Insuch embodiments, the indexing nodes 404 can retain the buckets in thedata stores 412 that they create rather than store the buckets in commonstorage 216. Further, in the architecture depicted at FIG. 2 of theParent Applications, the indexing nodes 404 of the present disclosurecan be used to execute searches on the buckets stored in the data stores412. In some embodiments, in the architecture depicted at FIG. 2 of theParent Applications, the partition manager 408 can receive data from oneor more forwarders 204 of the Parent Applications. As additionalforwarders 204 are added or as additional data is supplied to thearchitecture depicted at FIG. 2 of the Parent Applications, the indexingnode 406 can spawn additional partition manager 408 and/or the indexingmanager system 402 can spawn additional indexing nodes 404. In addition,in certain embodiments, the bucket manager 414 may merge buckets in thedata store 414 or be omitted from the architecture depicted at FIG. 2 ofthe Parent Applications.

Furthermore, in certain embodiments, the search head 210 of the ParentApplications can be replaced with the search head 504 of the presentdisclosure. In some cases, as described herein, the search head 504 canuse the search master 512 and search manager 514 to process and managerthe queries. However, rather than communicating with search nodes 506 toexecute a query, the search head 504 can, depending on the embodiment,communicate with the indexers 206 of the Parent Applications or thesearch nodes 404 to execute the query.

Similarly the architecture of FIG. 3 of the Parent Applications may bemodified in a variety of ways to include one or more components of thedata intake and query system 108 described herein. For example, thearchitecture of FIG. 3 of the Parent Applications may be modified toinclude an intake system 210 in accordance with the present disclosurewithin the cloud-based data intake and query system 1006 of the ParentApplications, which intake system 210 may logically include orcommunicate with the forwarders 204 of the Parent Applications. Inaddition, the indexing nodes 404 described herein may be utilized inplace of or to implement functionality similar to the indexers describedwith reference to FIG. 3 of the Parent Applications. In addition, thearchitecture of FIG. 3 of the Parent Applications may be modified toinclude common storage 216 and/or search nodes 506.

With respect to the architecture of FIG. 4 of the Parent Applications,the intake system 210 described herein may be utilized in place of or toimplement functionality similar to either or both the forwarders 204 orthe ERP processes 410 through 412 of the Parent Applications. Similarly,the indexing nodes 506 and the search head 504 described herein may beutilized in place of or to implement functionality similar to theindexer 206 and search head 210, respectively. In some cases, the searchmanager 514 described herein can manage the communications andinterfacing between the indexer 210 and the ERP processes 410 through412.

With respect to the flow diagrams and functionality described in FIGS.5A-5C, 6A, 6B, 7A-7D, 8A, 8B, 9, 10, 11A-11D, 12-16, and 17A-17D of theParent Applications, it will be understood that the processing andindexing operations described as being performed by the indexers 206 canbe performed by the indexing nodes 404, the search operations describedas being performed by the indexers 206 can be performed by the indexingnodes 404 or search nodes 506 (depending on the embodiment), and/or thesearching operations described as being performed by the search head210, can be performed by the search head 504 or other component of thequery system 214.

With reference to FIG. 18 of the Parent Applications, the indexing nodes404 and search heads 504 described herein may be utilized in place of orto implement functionality similar to the indexers 206 and search head210, respectively. Similarly, the search master 512 and search manager514 described herein may be utilized in place of or to implementfunctionality similar to the master 212 and the search service provider216, respectively, described with respect to FIG. 18 of the ParentApplications. Further, the intake system 210 described herein may beutilized in place of or to implement ingestion functionality similar tothe ingestion functionality of the worker nodes 214 of the ParentApplications. Similarly, the search nodes 506 described herein may beutilized in place of or to implement search functionality similar to thesearch functionality of the worker nodes 214 of the Parent Applications.

With reference to FIG. 25 of the Parent Applications, the indexing nodes404 and search heads 504 described herein may be utilized in place of orto implement functionality similar to the indexers 236 and search heads226, respectively. In addition, the search head 504 described herein maybe utilized in place of or to implement functionality similar to thedaemon 232 and the master 234 described with respect to FIG. 25 of theParent Applications. The intake system 210 described herein may beutilized in place of or to implement ingestion functionality similar tothe ingestion functionality of the worker nodes 214 of the ParentApplications. Similarly, the search nodes 506 described herein may beutilized in place of or to implement search functionality similar to thesearch functionality of the worker nodes 234 of the Parent Applications.

With reference to FIG. 27 of the Parent Applications, the indexing nodes404 or search nodes 506 described herein may be utilized in place of orto implement functionality similar to the index cache components 254.For example, the indexing nodes 404 may be utilized in place of or toimplement parsing, indexing, storing functionality of the index cachecomponents 254, and the search nodes 506 described herein may beutilized in place of or to implement searching or caching functionalitysimilar to the index cache components 254. In addition, the search head504 described herein may be utilized in place of or to implementfunctionality similar to the search heads 244, daemon 252, and/or themaster 250 described with respect to FIG. 27 of the Parent Applications.The intake system 210 described herein may be utilized in place of or toimplement ingestion functionality similar to the ingestion functionalityof the worker nodes 246 described with respect to FIG. 27 of the ParentApplications. Similarly, the search nodes 506 described herein may beutilized in place of or to implement search functionality similar to thesearch functionality of the worker nodes 234 described with respect toFIG. 27 of the Parent Applications. In addition, the common storage 216described herein may be utilized in place of or to implementfunctionality similar to the functionality of the cloud storage 256and/or global index 258 described with respect to FIG. 27 of the ParentApplications.

With respect to the architectures of FIGS. 33, 46, and 48 of the ParentApplications, the intake system 210 described herein may be utilized inplace of or to implement functionality similar to the forwarders 204. Inaddition, the indexing nodes 404 of the present disclosure can performthe functions described as being performed by the indexers 206 (e.g.,parsing, indexing, storing, and in some embodiments, searching) of thearchitectures of FIGS. 33, 46, and 48 of the Parent Applications; theoperations of the acceleration data store 3308 of the architectures ofFIGS. 33, 46, and 48 of the Parent Applications can be performed by theacceleration data store 222 of the present application; and theoperations of the search head 210, search process maser 3302, and querycoordinator 3304 of the architectures of FIGS. 33, 46, and 48 of theParent Applications can be performed by the search head 504, search nodecatalog 510, and or search node monitor 508 of the present application.For example, the functionality of the workload catalog 3312 and nodemonitor 3314 of the architectures of FIGS. 33, 46, and 48 of the ParentApplications can be performed by the search node catalog 510 and searchnode monitor 508; the functionality of the search head 210 and othercomponents of the search process master 3302 of the architectures ofFIGS. 33, 46, and 48 of the Parent Applications can be performed by thesearch head 504 or search master 512; and the functionality of the querycoordinator 3304 of the architectures of FIGS. 33, 46, and 48 of theParent Applications can be performed by the search manager 514.

In addition, in some embodiments, the searching operations described asbeing performed by the worker nodes 3306 of the architectures of FIGS.33, 46, and 48 of the Parent Applications can be performed by the searchnodes 506 of the present application and the intake or ingestionoperations performed by the worker nodes 3306 of the architectures ofFIGS. 33, 46, and 48 of the Parent Applications can be performed by theintake system 210. However, it will be understood that in someembodiments, the search nodes 506 can perform the intake and searchoperations described in the Parent Applications as being performed bythe worker nodes 3306. Furthermore, the cache manager 516 can implementone or more of the caching operations described in the ParentApplications with reference to the architectures of FIGS. 33, 46 , and48 of the Parent Applications.

With respect to FIGS. 46 and 48 of the Parent Applications, the commonstorage 216 of the present application can be used to provide thefunctionality with respect to the common storage 2602 of thearchitecture of FIGS. 46 and 48 of the Parent Applications. With respectto the architecture of FIG. 48 of the Parent Applications, the intakesystem 210 described herein may be utilized in place of or to implementoperations similar to the forwarders 204 and ingested data buffer 4802,and may in some instances implement all or a portion of the operationsdescribed in that reference with respect to worker nodes 3306. Thus, thearchitecture of the present disclosure, or components thereof, may beimplemented independently from or incorporated within architectures ofthe prior disclosures.

5.0 Terminology

Computer programs typically comprise one or more instructions set atvarious times in various memory devices of a computing device, which,when read and executed by at least one processor, will cause a computingdevice to execute functions involving the disclosed techniques. In someembodiments, a carrier containing the aforementioned computer programproduct is provided. The carrier is one of an electronic signal, anoptical signal, a radio signal, or a non-transitory computer-readablestorage medium.

Any or all of the features and functions described above can be combinedwith each other, except to the extent it may be otherwise stated aboveor to the extent that any such embodiments may be incompatible by virtueof their function or structure, as will be apparent to persons ofordinary skill in the art. Unless contrary to physical possibility, itis envisioned that (i) the methods/steps described herein may beperformed in any sequence and/or in any combination, and (ii) thecomponents of respective embodiments may be combined in any manner.

Although the subject matter has been described in language specific tostructural features and/or acts, it is to be understood that the subjectmatter defined in the appended claims is not necessarily limited to thespecific features or acts described above. Rather, the specific featuresand acts described above are disclosed as examples of implementing theclaims, and other equivalent features and acts are intended to be withinthe scope of the claims.

Conditional language, such as, among others, “can,” “could,” “might,” or“may,” unless specifically stated otherwise, or otherwise understoodwithin the context as used, is generally intended to convey that certainembodiments include, while other embodiments do not include, certainfeatures, elements and/or steps. Thus, such conditional language is notgenerally intended to imply that features, elements and/or steps are inany way required for one or more embodiments or that one or moreembodiments necessarily include logic for deciding, with or without userinput or prompting, whether these features, elements and/or steps areincluded or are to be performed in any particular embodiment.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise,” “comprising,” and thelike are to be construed in an inclusive sense, as opposed to anexclusive or exhaustive sense, i.e., in the sense of “including, but notlimited to.” As used herein, the terms “connected,” “coupled,” or anyvariant thereof means any connection or coupling, either direct orindirect, between two or more elements; the coupling or connectionbetween the elements can be physical, logical, or a combination thereof.Additionally, the words “herein,” “above,” “below,” and words of similarimport, when used in this application, refer to this application as awhole and not to any particular portions of this application. Where thecontext permits, words using the singular or plural number may alsoinclude the plural or singular number respectively. The word “or” inreference to a list of two or more items, covers all of the followinginterpretations of the word: any one of the items in the list, all ofthe items in the list, and any combination of the items in the list.Likewise the term “and/or” in reference to a list of two or more items,covers all of the following interpretations of the word: any one of theitems in the list, all of the items in the list, and any combination ofthe items in the list.

Conjunctive language such as the phrase “at least one of X, Y and Z,”unless specifically stated otherwise, is otherwise understood with thecontext as used in general to convey that an item, term, etc. may beeither X, Y or Z, or any combination thereof. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of X, at least one of Y and at least one of Z toeach be present. Further, use of the phrase “at least one of X, Y or Z”as used in general is to convey that an item, term, etc. may be eitherX, Y or Z, or any combination thereof.

In some embodiments, certain operations, acts, events, or functions ofany of the algorithms described herein can be performed in a differentsequence, can be added, merged, or left out altogether (e.g., not allare necessary for the practice of the algorithms). In certainembodiments, operations, acts, functions, or events can be performedconcurrently, e.g., through multi-threaded processing, interruptprocessing, or multiple processors or processor cores or on otherparallel architectures, rather than sequentially.

Systems and modules described herein may comprise software, firmware,hardware, or any combination(s) of software, firmware, or hardwaresuitable for the purposes described. Software and other modules mayreside and execute on servers, workstations, personal computers,computerized tablets, PDAs, and other computing devices suitable for thepurposes described herein. Software and other modules may be accessiblevia local computer memory, via a network, via a browser, or via othermeans suitable for the purposes described herein. Data structuresdescribed herein may comprise computer files, variables, programmingarrays, programming structures, or any electronic information storageschemes or methods, or any combinations thereof, suitable for thepurposes described herein. User interface elements described herein maycomprise elements from graphical user interfaces, interactive voiceresponse, command line interfaces, and other suitable interfaces.

Further, processing of the various components of the illustrated systemscan be distributed across multiple machines, networks, and othercomputing resources. Two or more components of a system can be combinedinto fewer components. Various components of the illustrated systems canbe implemented in one or more virtual machines or an isolated executionenvironment, rather than in dedicated computer hardware systems and/orcomputing devices. Likewise, the data repositories shown can representphysical and/or logical data storage, including, e.g., storage areanetworks or other distributed storage systems. Moreover, in someembodiments the connections between the components shown representpossible paths of data flow, rather than actual connections betweenhardware. While some examples of possible connections are shown, any ofthe subset of the components shown can communicate with any other subsetof components in various implementations.

Embodiments are also described above with reference to flow chartillustrations and/or block diagrams of methods, apparatus (systems) andcomputer program products. Each block of the flow chart illustrationsand/or block diagrams, and combinations of blocks in the flow chartillustrations and/or block diagrams, may be implemented by computerprogram instructions. Such instructions may be provided to a processorof a general purpose computer, special purpose computer,specially-equipped computer (e.g., comprising a high-performancedatabase server, a graphics subsystem, etc.) or other programmable dataprocessing apparatus to produce a machine, such that the instructions,which execute via the processor(s) of the computer or other programmabledata processing apparatus, create means for implementing the actsspecified in the flow chart and/or block diagram block or blocks. Thesecomputer program instructions may also be stored in a non-transitorycomputer-readable memory that can direct a computer or otherprogrammable data processing apparatus to operate in a particularmanner, such that the instructions stored in the computer-readablememory produce an article of manufacture including instruction meanswhich implement the acts specified in the flow chart and/or blockdiagram block or blocks. The computer program instructions may also beloaded to a computing device or other programmable data processingapparatus to cause operations to be performed on the computing device orother programmable apparatus to produce a computer implemented processsuch that the instructions which execute on the computing device orother programmable apparatus provide steps for implementing the actsspecified in the flow chart and/or block diagram block or blocks.

Any patents and applications and other references noted above, includingany that may be listed in accompanying filing papers, are incorporatedherein by reference. Aspects of the invention can be modified, ifnecessary, to employ the systems, functions, and concepts of the variousreferences described above to provide yet further implementations of theinvention. These and other changes can be made to the invention in lightof the above Detailed Description. While the above description describescertain examples of the invention, and describes the best modecontemplated, no matter how detailed the above appears in text, theinvention can be practiced in many ways. Details of the system may varyconsiderably in its specific implementation, while still beingencompassed by the invention disclosed herein. As noted above,particular terminology used when describing certain features or aspectsof the invention should not be taken to imply that the terminology isbeing redefined herein to be restricted to any specific characteristics,features, or aspects of the invention with which that terminology isassociated. In general, the terms used in the following claims shouldnot be construed to limit the invention to the specific examplesdisclosed in the specification, unless the above Detailed Descriptionsection explicitly defines such terms. Accordingly, the actual scope ofthe invention encompasses not only the disclosed examples, but also allequivalent ways of practicing or implementing the invention under theclaims.

To reduce the number of claims, certain aspects of the invention arepresented below in certain claim forms, but the applicant contemplatesother aspects of the invention in any number of claim forms. Forexample, while only one aspect of the invention is recited as ameans-plus-function claim under 35 U.S.C sec. 112(f) (AIA), otheraspects may likewise be embodied as a means-plus-function claim, or inother forms, such as being embodied in a computer-readable medium. Anyclaims intended to be treated under 35 U.S.C. § 112(f) will begin withthe words “means for,” but use of the term “for” in any other context isnot intended to invoke treatment under 35 U.S.C. § 112(f). Accordingly,the applicant reserves the right to pursue additional claims afterfiling this application, in either this application or in a continuingapplication.

6.0 Example Embodiments

Various example embodiments of methods, systems, and non-transitorycomputer-readable media relating to features described herein can befound in the following clauses:

-   -   Clause 1. A method, comprising:    -   obtaining a stream of raw machine data generated by one or more        components in an information technology environment for        processing by a data processing pipeline;    -   for each raw machine data in the stream of raw machine data as        the respective raw machine data is obtained,        -   generating, using a machine learning model that is a            component in the data processing pipeline, a prediction            regarding a property of the respective raw machine data,        -   evolving the machine learning model in response to the            respective raw machine data satisfying a condition;        -   generating an output based on at least some of the generated            predictions; and        -   providing the output to another component in the data            processing pipeline.    -   Clause 2. The method of Clause 1, wherein generating a        prediction further comprises generating an indication of whether        the respective raw machine data is an outlier.    -   Clause 3. The method of Clause 1, wherein generating a        prediction further comprises:    -   generating a data subset using the respective raw machine data,        wherein the data subset is associated with a timestamp;    -   placing the data subset in an ordered hierarchy of data subsets        using the timestamp to form an updated ordered hierarchy of data        subsets;    -   determining a first quantile and a second quantile using the        updated ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 4. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining that no data subsets in an ordered hierarchy of data        subsets generated using raw machine data already applied to the        machine learning model are to be discarded;    -   generating a new data subset using the respective raw machine        data, wherein the new data subset is associated with a        timestamp;    -   placing the new data subset in the ordered hierarchy of data        subsets using the timestamp to form an updated ordered hierarchy        of data subsets;    -   determining a first quantile and a second quantile using the        updated ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 5. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining that a first data subset in an ordered hierarchy of        data subsets generated using raw machine data already applied to        the machine learning model is to be discarded;    -   discarding the first data subsets from the ordered hierarchy of        data subsets to form an updated ordered hierarchy of data        subsets;    -   generating a new data subset using the respective raw machine        data, wherein the new data subset is associated with a        timestamp;    -   placing the new data subset in the updated ordered hierarchy of        data subsets using the timestamp to form a second updated        ordered hierarchy of data subsets;    -   determining a first quantile and a second quantile using the        second updated ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 6. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining that a first data subset in an ordered hierarchy of        data subsets generated using raw machine data already applied to        the machine learning model includes at least one raw machine        data associated with a timestamp older than a threshold time;    -   discarding the first data subsets from the ordered hierarchy of        data subsets to form an updated ordered hierarchy of data        subsets;    -   generating a new data subset using the respective raw machine        data, wherein the new data subset is associated with a        timestamp;    -   placing the new data subset in the updated ordered hierarchy of        data subsets using the timestamp to form a second updated        ordered hierarchy of data subsets;    -   determining a first quantile and a second quantile using the        second updated ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 7. The method of Clause 1, wherein generating a        prediction further comprises:    -   generating a data subset using the respective raw machine data,        wherein the data subset is associated with a timestamp;    -   placing the data subset in an ordered hierarchy of data subsets        using the timestamp to form an updated ordered hierarchy of data        subsets;    -   iterating through the updated ordered hierarchy of data subsets,        from a most recent data subset in the updated ordered hierarchy        of data subsets to a least recent data subset in the updated        ordered hierarchy of data subsets, to determine whether        successive data subsets in the updated ordered hierarchy of data        subsets are to be merged;    -   merging successive data subsets in the updated ordered hierarchy        of data subsets that are determined to be merged to form a        merged ordered hierarchy of data subsets;    -   determining a first quantile and a second quantile using the        merged ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 8. The method of Clause 1, wherein generating a        prediction further comprises:    -   generating a data subset using the respective raw machine data,        wherein the data subset is associated with a timestamp;    -   placing the data subset in an ordered hierarchy of data subsets        using the timestamp to form an updated ordered hierarchy of data        subsets;    -   for each data subset in the updated ordered hierarchy of data        subsets, determining a first quantile and a second quantile;    -   aggregating the first quantiles;    -   aggregating the second quantiles; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        aggregated first quantiles and the aggregated second quantiles.    -   Clause 9. The method of Clause 1, wherein generating a        prediction further comprises:    -   generating a data subset using the respective raw machine data,        wherein the data subset is associated with a timestamp;    -   placing the data subset in an ordered hierarchy of data subsets        using the timestamp to form an updated ordered hierarchy of data        subsets;    -   determining a first quantile and a second quantile using the        updated ordered hierarchy of data subsets; and    -   generating the prediction that the respective raw machine data        is an outlier value in response to a determination that the raw        machine data falls below the first quantile or falls above the        second quantile.    -   Clause 10. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining that no sketches in an ordered hierarchy of sketches        generated using raw machine data already applied to the machine        learning model are to be discarded;    -   generating a new sketch using the respective raw machine data,        wherein the new sketch is associated with a timestamp;    -   placing the new sketch in the ordered hierarchy of sketches        using the timestamp to form an updated ordered hierarchy of        sketches;    -   iterating through the updated ordered hierarchy of sketches,        from a most recent sketch in the updated ordered hierarchy of        sketches to a least recent sketch in the updated ordered        hierarchy of sketches, to determine whether successive sketches        in the updated ordered hierarchy of sketches are to be merged;    -   merging successive sketches in the updated ordered hierarchy of        sketches that are determined to be merged to form a merged        ordered hierarchy of sketches;    -   determining a first quantile and a second quantile using the        merged ordered hierarchy of sketches; and    -   generating the prediction that the respective raw machine data        is one of an outlier value or a normal value based on the        determined first quantile and the second quantile.    -   Clause 11. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining that a sequence of the respective raw machine data        and other raw machine data already applied to the machine        learning model correspond with a first data pattern; and    -   in response to determining that the sequence corresponds with        the first data pattern, generating the prediction that the        sequence is anomalous.    -   Clause 12. The method of Clause 1, wherein generating a        prediction further comprises:    -   comparing a sequence of the respective raw machine data and        other raw machine data already applied to the machine learning        model correspond with a first set of data patterns;    -   assigning the sequence to a new data pattern separate from the        first set of data patterns based on a distance between the        sequence and each data pattern in the first set of data patterns        being greater than a minimum cluster distance; and    -   determining that the sequence is anomalous in response to an        assignment of the sequence to the new data pattern.    -   Clause 13. The method of Clause 1, wherein the respective raw        machine data comprises text and a rating, and wherein evolving        the machine learning model further comprises evolving the        machine learning model using the text and the rating.    -   Clause 14. The method of Clause 1, wherein the respective raw        machine data comprises text and a rating that corresponds with        one or a positive sentiment or a negative sentiment, and wherein        evolving the machine learning model further comprises evolving        the machine learning model using the text and the rating.    -   Clause 15. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises generating the prediction using the machine        learning model and the text, wherein the prediction comprises a        rating.    -   Clause 16. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises generating the prediction using the machine        learning model and the text, wherein the prediction comprises a        rating and one of a positive sentiment or a negative sentiment        that is based on the rating.    -   Clause 17. The method of Clause 1, wherein generating a        prediction further comprises:    -   generating one or more tokens using the text;    -   generating a vector using the one or more tokens; and    -   applying the vector as an input to the machine learning model to        generate the prediction.    -   Clause 18. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises:    -   generating one or more tokens using the text;    -   generating a vector using the one or more tokens; and    -   applying the vector as an input to the machine learning model to        generate the prediction, wherein the prediction comprises one of        an indication that the respective raw machine data is associated        with a positive sentiment or an indication that the respective        raw machine data is associated with a negative sentiment.    -   Clause 19. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises:    -   generating one or more tokens using the text;    -   generating a vector using the one or more tokens; and    -   applying the vector as an input to the machine learning model to        generate the prediction, wherein the machine learning model is        trained using an online stochastic gradient descent algorithm.    -   Clause 20. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises:    -   generating one or more tokens using the text;    -   generating a vector using the one or more tokens; and    -   applying the vector as an input to the machine learning model to        generate the prediction, wherein the machine learning model is        trained using an adaptive online stochastic gradient descent        algorithm.    -   Clause 21. The method of Clause 1, wherein the respective raw        machine data comprises text, and wherein generating a prediction        further comprises:    -   generating one or more tokens using the text;    -   generating a vector using the one or more tokens; and    -   applying the vector as an input to the machine learning model to        generate the prediction, wherein the machine learning model is        trained using a norm version of an adaptive online stochastic        gradient descent algorithm.    -   Clause 22. The method of Clause 1, wherein generating a        prediction further comprises detecting that the respective raw        machine data is a transition point at which subsequent raw        machine data in the stream of raw machine data have a different        distribution than previous raw machine data in the stream of raw        machine data.    -   Clause 23. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining a probability that the respective raw machine data        comprises a changepoint at which subsequent raw machine data in        the stream of raw machine data have a different distribution        than previous raw machine data in the stream of raw machine        data; and    -   generating the prediction based on the determined probability.    -   Clause 24. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining a probability that the respective raw machine data        comprises a changepoint at which subsequent raw machine data in        the stream of raw machine data have a different distribution        than previous raw machine data in the stream of raw machine        data; and    -   generating the prediction indicating that the respective raw        machine data comprises the changepoint based on the determined        probability.    -   Clause 25. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining a probability that the respective raw machine data        comprises a changepoint at which subsequent raw machine data in        the stream of raw machine data have a different distribution        than previous raw machine data in the stream of raw machine        data;    -   determining a probability that the respective raw machine data        has a same distribution as previous raw machine data in the        stream of raw machine data; and    -   generating the prediction based on the determined probabilities.    -   Clause 26. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining, using a finite number of previous raw machine data        probability distributions, a probability that the respective raw        machine data comprises a changepoint at which subsequent raw        machine data in the stream of raw machine data have a different        distribution than previous raw machine data in the stream of raw        machine data;    -   determining, using the finite number of the previous raw machine        data probability distributions, a probability that the        respective raw machine data has a same distribution as previous        raw machine data in the stream of raw machine data; and    -   generating the prediction based on the determined probabilities.    -   Clause 27. The method of Clause 1, wherein generating a        prediction further comprises:    -   determining a probability distribution for the respective raw        machine data;    -   discarding a probability distribution for a previous raw machine        data in the stream of raw machine data that is associated with a        time outside of a time window;    -   determining an updated probability distribution for each        probability distribution in a first set of probability        distributions that are each associated with a time inside the        time window using at least one of the respective raw machine        data or the discarded probability distribution to form a first        set of updated probability distributions; and    -   generating the prediction indicating whether the respective raw        machine data comprises a changepoint based on the determined        probability distribution for the respective raw machine data and        the first set of updated probability distributions.    -   Clause 28. The method of Clause 1, wherein the condition        comprises one of the respective raw machine data is associated        with a time falling within a time window, the respective raw        machine data is greater than a minimum cluster distance from a        set of data patterns, the respective raw machine data does not        comprise a rating, or the respective raw machine data is one of        a threshold number of most recent raw machine data in the        stream.    -   Clause 29. A system, comprising:    -   one or more data stores including computer-executable        instructions; and    -   one or more processors configured to execute the        computer-executable instructions, wherein execution of the        computer-executable instructions causes the system to:    -   obtain a stream of raw machine data generated by one or more        components in an information technology environment for        processing by a data processing pipeline;    -   for each raw machine data in the stream of raw machine data as        the respective raw machine data is obtained,        -   generate, using a machine learning model that is a component            in the data processing pipeline, a prediction regarding a            property of the respective raw machine data,        -   evolve the machine learning model in response to the            respective raw machine data satisfying a condition;        -   generate an output based on at least some of the generated            predictions; and        -   provide the output to another component in the data            processing pipeline.    -   Clause 30. Non-transitory computer-readable media comprising        instructions executable by a computing system to:    -   obtain a stream of raw machine data generated by one or more        components in an information technology environment for        processing by a data processing pipeline;    -   for each raw machine data in the stream of raw machine data as        the respective raw machine data is obtained,        -   generate, using a machine learning model that is a component            in the data processing pipeline, a prediction regarding a            property of the respective raw machine data,        -   evolve the machine learning model in response to the            respective raw machine data satisfying a condition;        -   generate an output based on at least some of the generated            predictions; and        -   provide the output to another component in the data            processing pipeline.    -   Clause 31. A method, comprising:        -   extracting one or more tokens from raw machine data, the raw            machine data generated by one or more components in an            information technology environment;        -   comparing the extracted one or more tokens to a first set of            data patterns;        -   determining that a first value of a first token in the one            or more tokens is anomalous in response to the comparison,            wherein the first value of the first token is determined to            be anomalous prior to the raw machine data being indexed and            stored in a data intake and query system;        -   determining that a second value of a second token in the one            or more tokens corresponds to a range of values; and        -   causing display of information indicating that there is a            correlation between the second token having the second value            and the first token having an anomalous value.    -   Clause 32. The method of Clause 31, further comprising:        -   extracting the first token and the second token from second            raw machine data, the second raw machine data generated by            the one or more components in the information technology            environment prior to generation of the raw machine data;        -   comparing the first token and the second token from second            raw machine data to the first set of data patterns;        -   determining that a third value of the first token from the            second raw machine data is anomalous in response to the            comparison; and        -   storing a fourth value of the second token from the second            raw machine data, wherein the fourth value is a minimum            value in the range of values.    -   Clause 33. The method of Clause 31, further comprising:        -   extracting the first token and the second token from the            second raw machine data, the second raw machine data            generated by the one or more components in the information            technology environment prior to generation of the raw            machine data;        -   comparing the first token and the second token from second            raw machine data to the first set of data patterns;        -   determining that a third value of the first token from the            second raw machine data is anomalous in response to the            comparison;        -   storing a fourth value of the second token from the second            raw machine data, wherein the fourth value is a minimum            value in the range of values;        -   extracting the first token and the second token from third            raw machine data, the third raw machine data generated by            the one or more components in the information technology            environment prior to generation of the raw machine data;        -   comparing the first token and the second token from the            third raw machine data to the first set of data patterns;        -   determining that a fifth value of the first token from the            third raw machine data is anomalous in response to the            comparison; and        -   storing a sixth value of the second token from the third raw            machine data, wherein the sixth value is a maximum value in            the range of values.    -   Clause 34. The method of Clause 31, further comprising:        -   extracting the first token and the second token from second            raw machine data, the second raw machine data generated by            the one or more components in the information technology            environment prior to generation of the raw machine data;        -   comparing the first token and the second token from the            second raw machine data to the first set of data patterns;        -   determining that a third value of the first token from the            second raw machine data is anomalous in response to the            comparison;        -   storing a fourth value of the second token from the second            raw machine data, wherein the fourth value is a minimum            value in the range of values;        -   extracting the first token and the second token from third            raw machine data, the third raw machine data generated by            the one or more components in the information technology            environment prior to generation of the raw machine data;        -   comparing the first token and the second token from the            third raw machine data to the first set of data patterns;        -   determining that a fifth value of the first token from the            third raw machine data is anomalous in response to the            comparison;        -   storing a sixth value of the second token from the third raw            machine data, wherein the sixth value is a maximum value in            the range of values;        -   extracting the first token and the second token from fourth            raw machine data, the fourth raw machine data generated by            the one or more components in the information technology            environment prior to generation of the raw machine data;        -   comparing the first token and the second token from the            fourth raw machine data to the first set of data patterns;        -   determining that a seventh value of the first token from the            fourth raw machine data is not anomalous in response to the            comparison;        -   determining that an eighth value of the second token from            the fourth raw machine data does not fall within the range            of values; and        -   determining that the range of values correlates to values of            the first token being anomalous.    -   Clause 35. The method of Clause 31, wherein determining that a        second value of a second token in the one or more tokens        corresponds to a range of values further comprising determining        that the second value of the second token matches a specific        value.    -   Clause 36. The method of Clause 31, further comprising:        -   determining that a third value of a third token in the one            or more tokens corresponds to a second range of values; and        -   causing display of information indicating that there is a            correlation between the second token having the second            value, the third token having the third value, and the first            token having an anomalous value.    -   Clause 37. The method of Clause 31, wherein the information        indicates that the first value of the first token is anomalous.    -   Clause 38. The method of Clause 31, wherein the information        comprises at least one of a notification, a table, a graph, a        chart, or an annotated version of the raw machine data.    -   Clause 39. The method of Clause 31, wherein the first token        comprises user device usage, and wherein the second token        comprises a user device model.    -   Clause 40. The method of Clause 31, wherein extracting one or        more tokens from raw machine data further comprises extracting        the one or more tokens from the raw machine data within a        threshold time of the raw machine data being ingested into the        data intake and query system.    -   Clause 41. The method of Clause 31, wherein a stream of raw        machine data is ingested into the data intake and query system        in sequence, wherein the stream of raw machine data comprises        the raw machine data other raw machine data that follows the raw        machine data in time, and wherein determining that a first value        of a first token in the one or more tokens is anomalous further        comprises determining that the first value of the first token in        the one or more tokens is anomalous prior to any of the other        raw machine data being stored in the data intake and query        system.    -   Clause 42. The method of Clause 31, wherein a stream of raw        machine data is ingested into the data intake and query system        in sequence, wherein the stream of raw machine data comprises        the raw machine data other raw machine data that follows the raw        machine data in time, and wherein the method further comprises        determining in sequence, for each of the other raw machine data,        whether the respective other raw machine data is anomalous as        the respective other raw machine data is ingested into the data        intake and query system and subsequent to determining that the        first value of the first token in the one or more tokens is        anomalous.    -   Clause 43. The method of Clause 31, wherein extracting one or        more tokens further comprises generating a string vector using        the one or more tokens.    -   Clause 44. The method of Clause 31, wherein extracting one or        more tokens further comprises generating a string vector using        the one or more tokens, and wherein each element of the string        vector corresponds to one of the one or more tokens.    -   Clause 45. The method of Clause 31, wherein determining that a        first value of a first token in the one or more tokens is        anomalous further comprises:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance; and        -   determining that the first value of the first token is            anomalous in response to an assignment of the one or more            tokens to the new data pattern.    -   Clause 46. The method of Clause 31, wherein determining that a        first value of a first token in the one or more tokens is        anomalous further comprises:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance;        -   updating the minimum cluster distance based on a creation of            the new data pattern; and        -   determining that the first value of the first token is            anomalous in response to an assignment of the one or more            tokens to the new data pattern.    -   Clause 47. The method of Clause 31, wherein determining that a        first value of a first token in the one or more tokens is        anomalous further comprises:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance, wherein the one or more tokens is assigned            to the new data pattern prior to the raw machine data being            indexed and stored in the data intake and query system;        -   updating the minimum cluster distance based on a creation of            the new data pattern; and        -   determining that the first value of the first token is            anomalous in response to an assignment of the one or more            tokens to the new data pattern.    -   Clause 48. The method of Clause 31, wherein determining that a        first value of a first token in the one or more tokens is        anomalous further comprises:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance, wherein the one or more tokens is assigned            to the new data pattern prior to the raw machine data being            indexed and stored in the data intake and query system;        -   updating the minimum cluster distance based on a creation of            the new data pattern;        -   extracting one or more second tokens from second raw machine            data, the second raw machine data generated by the one or            more components in the information technology environment;        -   comparing the one or more second tokens to the first set of            data patterns and the new data pattern; and        -   assigning the one or more second tokens to a first data            pattern in the first set of data patterns based on a            distance between the one or more second tokens and the first            data pattern being less than the updated minimum cluster            distance.    -   Clause 49. The method of Clause 31, further comprising:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance, wherein the one or more tokens is assigned            to the new data pattern prior to the raw machine data being            indexed and stored in the data intake and query system;        -   updating the minimum cluster distance based on a creation of            the new data pattern;        -   extracting one or more second tokens from second raw machine            data, the second raw machine data generated by the one or            more components in the information technology environment;        -   comparing the one or more second tokens to the first set of            data patterns and the new data pattern;        -   assigning the one or more second tokens to a first data            pattern in the first set of data patterns based on a            distance between the one or more second tokens and the first            data pattern being less than the updated minimum cluster            distance;        -   determining that the first data pattern does not completely            describe the one or more second tokens; and        -   updating the first data pattern to include a wildcard such            that the updated first data pattern completely describes the            one or more second tokens.    -   Clause 50. The method of Clause 31, further comprising:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance, wherein the one or more tokens is assigned            to the new data pattern prior to the raw machine data being            indexed and stored in the data intake and query system;        -   updating the minimum cluster distance based on a creation of            the new data pattern;        -   extracting one or more second tokens from second raw machine            data, the second raw machine data generated by the one or            more components in the information technology environment;        -   comparing the one or more second tokens to the first set of            data patterns and the new data pattern;        -   assigning the one or more second tokens to a first data            pattern in the first set of data patterns based on a            distance between the one or more second tokens and the first            data pattern being less than the updated minimum cluster            distance, wherein the first data pattern comprises a            wildcard at a first position;        -   determining a distribution of token values at the first            position in tokens assigned to the first data pattern;        -   determining that a token value at the first position in the            one or more second tokens falls below a percentile in the            distribution; and        -   determining that the second raw machine data corresponding            to the one or more second tokens is anomalous in response to            the token value at the first position in the one or more            second tokens falling below the percentile.    -   Clause 51. The method of Clause 31, further comprising:        -   assigning the one or more tokens to a new data pattern            separate from the first set of data patterns based on a            distance between the one or more tokens and each data            pattern in the first set being greater than a minimum            cluster distance, wherein the one or more tokens is assigned            to the new data pattern prior to the raw machine data being            indexed and stored in the data intake and query system;        -   updating the minimum cluster distance based on a creation of            the new data pattern;        -   extracting the second token from second raw machine data,            the second raw machine data generated by the one or more            components in the information technology environment;        -   comparing the second token from the second raw machine data            to the first set of data patterns and the new data pattern;        -   assigning the second token from the second raw machine data            to a first data pattern in the first set of data patterns            based on a distance between the second token from the second            raw machine data and the first data pattern being less than            the updated minimum cluster distance, wherein the first data            pattern comprises a wildcard at a first position;        -   determining a distribution of token values at the first            position in tokens assigned to the first data pattern;        -   determining that a token value at the first position in the            second token from the second raw machine data falls below a            percentile in the distribution;        -   determining that the second raw machine data corresponding            to the second token from the second raw machine data is            anomalous in response to the token value at the first            position in the second token from the second raw machine            data falling below the percentile;        -   determining that a third value of the second token from the            second raw machine data corresponds to the range of values;            and        -   causing display of second information indicating that there            is a correlation between the second token having the third            value and the second raw machine data being anomalous.    -   Clause 52. The method of Clause 31, wherein extracting one or        more tokens further comprises:        -   identifying one or more delimiters in the raw machine data;        -   identifying the one or more tokens based on the identified            one or more delimiters; and        -   forming the one or more tokens using the one or more tokens.    -   Clause 53. The method of Clause 31, further comprising:        -   extracting one or more second tokens from second raw machine            data;        -   comparing the extracted one or more second tokens to the            first set of data patterns;        -   determining that a third value of a third token in the one            or more second tokens is anomalous in response to the            comparison;        -   determining that no token in the one or more second tokens            is correlated with the third token having the third value;            and        -   extracting a fourth token from the second raw machine data;        -   determining that there is a correlation between the fourth            token and the third token; and        -   causing display of information indicating that there is a            correlation between the fourth token having the fourth value            and the third token having an anomalous value.    -   Clause 54. A system, comprising:        -   one or more data stores including computer-executable            instructions; and        -   one or more processors configured to execute the            computer-executable instructions, wherein execution of the            computer-executable instructions causes the system to:            -   extract one or more tokens from raw machine data, the                raw machine data generated by one or more components in                an information technology environment;            -   compare the extracted one or more tokens to a first set                of data patterns;            -   determine that a first value of a first token in the one                or more tokens is anomalous in response to the                comparison, wherein the first value of the first token                is determined to be anomalous prior to the raw machine                data being indexed and stored in a data intake and query                system;            -   determine that a second value of a second token in the                one or more tokens corresponds to a range of values; and            -   cause display of information indicating that there is a                correlation between the second token having the second                value and the first token having an anomalous value.    -   Clause 55. The system of Clause 54, wherein execution of the        computer-executable instructions further causes the system to:        -   extract the first token and the second token from second raw            machine data, the second raw machine data generated by the            one or more components in the information technology            environment prior to generation of the raw machine data;        -   compare the first token and the second token from second raw            machine data to the first set of data patterns;        -   determine that a third value of the first token from the            second raw machine data is anomalous in response to the            comparison; and        -   store a fourth value of the second token from the second raw            machine data, wherein the fourth value is a minimum value in            the range of values.    -   Clause 56. The system of Clause 54, wherein the information        comprises at least one of a notification, a table, a graph, a        chart, or an annotated version of the raw machine data.    -   Clause 57. The system of Clause 54, wherein execution of the        computer-executable instructions further causes the system to:        -   extract one or more second tokens from second raw machine            data;        -   compare the extracted one or more second tokens to the first            set of data patterns;        -   determine that a third value of a third token in the one or            more second tokens is anomalous in response to the            comparison;        -   determine that no token in the one or more second tokens is            correlated with the third token having the third value; and        -   extract a fourth token from the second raw machine data;        -   determine that there is a correlation between the fourth            token and the third token; and        -   cause display of information indicating that there is a            correlation between the fourth token having the fourth value            and the third token having an anomalous value.    -   Clause 58. Non-transitory computer-readable media comprising        instructions executable by a computing system to:        -   extract one or more tokens from raw machine data, the raw            machine data generated by one or more components in an            information technology environment;        -   compare the extracted one or more tokens to a first set of            data patterns;        -   determine that a first value of a first token in the one or            more tokens is anomalous in response to the comparison,            wherein the first value of the first token is determined to            be anomalous prior to the raw machine data being indexed and            stored in a data intake and query system;        -   determine that a second value of a second token in the one            or more tokens corresponds to a range of values; and        -   cause display of information indicating that there is a            correlation between the second token having the second value            and the first token having an anomalous value.    -   Clause 59. The non-transitory computer-readable media of Clause        58, further comprising instructions executable by a computing        system to:        -   extract the first token and the second token from second raw            machine data, the second raw machine data generated by the            one or more components in the information technology            environment prior to generation of the raw machine data;        -   compare the first token and the second token from second raw            machine data to the first set of data patterns;        -   determine that a third value of the first token from the            second raw machine data is anomalous in response to the            comparison; and        -   store a fourth value of the second token from the second raw            machine data, wherein the fourth value is a minimum value in            the range of values.    -   Clause 60. The non-transitory computer-readable media of Clause        58, further comprising instructions executable by a computing        system to:        -   extract one or more second tokens from second raw machine            data;        -   compare the extracted one or more second tokens to the first            set of data patterns;        -   determine that a third value of a third token in the one or            more second tokens is anomalous in response to the            comparison;        -   determine that no token in the one or more second tokens is            correlated with the third token having the third value; and        -   extract a fourth token from the second raw machine data;        -   determine that there is a correlation between the fourth            token and the third token; and        -   cause display of information indicating that there is a            correlation between the fourth token having the fourth value            and the third token having an anomalous value.    -   Clause 61. A method, comprising:        -   providing a user interface depicting a graph representing a            data processing pipeline, wherein the graph comprises a            first data processing node interconnected with a machine            learning model;        -   receiving, via the user interface, a request to activate a            preview mode in association with the machine learning model;        -   obtaining first data generated by the first data processing            node;        -   applying the first data as an input to the machine learning            model to generate output data;        -   determining that the output data comprises a first number of            a first label type and a second number of a second label            type;        -   selecting a first subset of the first number of the first            label type and a second subset of the second number of the            second label type; and        -   causing the user interface to display a preview of the            output data output by the machine learning model that            comprises the first subset of the first number of the first            label type and the second subset of the second number of the            second label type.    -   Clause 62. The method of Clause 61, wherein causing the user        interface to display a preview further comprises causing the        user interface to display the preview without writing the output        data to at least one destination specified by the graph.    -   Clause 63. The method of Clause 61, further comprising        retrieving input data from at least one source specified by the        graph in response to the request to activate the preview mode.    -   Clause 64. The method of Clause 61, wherein the first data        comprises live data streamed from a source specified by the        graph.    -   Clause 65. The method of Clause 61, further comprising:        -   retrieving input data from at least one source specified by            the graph in response to the request to activate the preview            mode; and        -   causing the input data to be transformed according to the            first data processing node to generate the first data.    -   Clause 66. The method of Clause 61, further comprising        transmitting an abstract syntax tree (AST) of the data        processing pipeline to an intake system, wherein the intake        system produces an augmented AST by causing a function of the        graph that writes to an external database to drop received data        instead of writing the received data to the external database        and by adding a preview node to the graph in association with        the machine learning model.    -   Clause 67. The method of Clause 61, further comprising        transmitting an abstract syntax tree (AST) of the data        processing pipeline to an intake system, wherein the intake        system produces an augmented AST by causing a function of the        graph that writes to an external database to drop received data        instead of writing the received data to the external database        and by adding a preview node to the graph in association with        the machine learning model, and wherein the intake system runs a        job using the augmented AST that results in the first data being        transmitted to the preview node.    -   Clause 68. The method of Clause 61, further comprising        transmitting an abstract syntax tree (AST) of the data        processing pipeline to an intake system, wherein the intake        system produces an augmented AST by causing a function of the        graph that writes to an external database to drop received data        instead of writing the received data to the external database        and by adding a preview node to the graph in association with        the machine learning model, wherein the intake system runs a job        using the augmented AST that results in the first data being        transmitted to the preview node, and wherein applying the first        data as an input to the machine learning model to generate        output data further comprises applying, by the preview node, the        first data as an input to the machine learning model to generate        output data.    -   Clause 69. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, and wherein applying the first data        as an input to the machine learning model further comprises        applying, in sequence, each of the data items of the stream of        data items as an input to the machine learning model to generate        the output data.    -   Clause 70. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model further comprises, for        each data item of the stream of data items, applying the        respective data item as an input to the machine learning model        to generate a portion of the output data, and wherein        determining that the output data comprises a first number of a        first label type and a second number of a second label type        further comprises, for each data item of the stream of data        items, determining that the portion of the output data generated        using the respective data item corresponds to one of the first        label type or the second label type after the portion of the        output data is generated and before a subsequent portion of the        output data is generated.    -   Clause 71. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model further comprises, for        each data item of the stream of data items in sequence, applying        the respective data item as an input to the machine learning        model to generate a portion of the output data, and wherein        determining that the output data comprises a first number of a        first label type and a second number of a second label type        further comprises:        -   for each data item of the stream of data items in sequence,            determining that the portion of the output data generated            using the respective data item corresponds to one of the            first label type or the second label type after the portion            of the output data is generated and before a subsequent            portion of the output data is generated; and        -   incrementing a count of one of the first label type or the            second label type.    -   Clause 72. The method of Clause 61, wherein applying the first        data as an input to the machine learning model to generate        output data further comprises applying the first data as the        input to the machine learning model for a first period of time.    -   Clause 73. The method of Clause 61, wherein applying the first        data as an input to the machine learning model to generate        output data further comprises applying the first data as the        input to the machine learning model for a first period of time,        and wherein the first data corresponds to a second period of        time.    -   Clause 74. The method of Clause 61, wherein applying the first        data as an input to the machine learning model to generate        output data further comprises applying the first data as the        input to the machine learning model for a first period of time,        and wherein the first data corresponds to a second period of        time greater than the first period of time.    -   Clause 75. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model to generate output data        further comprises:        -   for each data item of the stream of data items in sequence,            applying the respective data item as an input to the machine            learning model to generate a portion of the output data; and        -   determining, a first period of time after an initial portion            of the output data is generated, that no portion of the            output data corresponds to a third type of label.    -   Clause 76. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model to generate output data        further comprises:        -   for each data item of the stream of data items in sequence,            applying the respective data item as an input to the machine            learning model to generate a portion of the output data;        -   determining, a first period of time after an initial portion            of the output data is generated, that no portion of the            output data corresponds to a third type of label; and        -   stopping application of the stream of data items as an input            to the machine learning model.    -   Clause 77. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model to generate output data        further comprises:        -   for each data item of the stream of data items in sequence,            applying the respective data item as an input to the machine            learning model to generate a portion of the output data; and        -   stopping application of the stream of data items as an input            to the machine learning model after a timeout period            expires.    -   Clause 78. The method of Clause 61, wherein the first data        comprises a stream of data items generated by the first data        processing node in sequence, wherein applying the first data as        an input to the machine learning model to generate output data        further comprises:        -   for each data item of the stream of data items in sequence,            applying the respective data item as an input to the machine            learning model to generate a portion of the output data; and        -   stopping application of the stream of data items as an input            to the machine learning model after a timeout period            expires, wherein the timeout period begins at a time that an            initial portion of the output data is generated.    -   Clause 79. The method of Clause 61, wherein the first number is        greater than the second number.    -   Clause 80. The method of Clause 61, wherein the first number is        greater than the second number, and wherein a number of the        first subset of the first number of the first label type equals        a number of the second subset of the second number of the second        label type.    -   Clause 81. The method of Clause 61, wherein selecting a first        subset of the first number of the first label type and a second        subset of the second number of the second label type further        comprises selecting an equal number of the first label type and        the second label type to form the first subset and the second        subset.    -   Clause 82. The method of Clause 61, wherein selecting a first        subset of the first number of the first label type and a second        subset of the second number of the second label type further        comprises downsampling the first number of the first label type        and upsampling the second number of the second label type.    -   Clause 83. The method of Clause 61, wherein the output data is        provided as an input to a second data processing node of the        graph.    -   Clause 84. The method of Clause 61, wherein a first tab in a        user interface depicts an interactive element that allows a user        to request activation of the preview mode.    -   Clause 85. The method of Clause 61, wherein a first tab in a        user interface depicts an interactive element that allows a user        to request activation of the preview mode, and wherein the        preview is displayed in a second tab in the user interface.    -   Clause 86. The method of Clause 61, wherein a first window in a        user interface depicts an interactive element that allows a user        to request activation of the preview mode, and wherein the        preview is displayed in a second window in the user interface.    -   Clause 87. The method of Clause 61, wherein the first label type        comprises a first type of event.    -   Clause 88. A system, comprising:        -   one or more data stores including computer-executable            instructions; and        -   one or more processors configured to execute the            computer-executable instructions, wherein execution of the            computer-executable instructions causes the system to:            -   provide a user interface depicting a graph representing                a data processing pipeline, wherein the graph comprises                a first data processing node interconnected with a                machine learning model;            -   receive, via the user interface, a request to activate a                preview mode in association with the machine learning                model;            -   obtain first data generated by the first data processing                node;            -   apply the first data as an input to the machine learning                model to generate output data;            -   determine that the output data comprises a first number                of a first label type and a second number of a second                label type;            -   select a first subset of the first number of the first                label type and a second subset of the second number of                the second label type; and            -   cause the user interface to display a preview of the                output data output by the machine learning model that                comprises the first subset of the first number of the                first label type and the second subset of the second                number of the second label type.    -   Clause 89. The system of Clause 88, wherein execution of the        computer-executable instructions further causes the system to        cause the user interface to display the preview without writing        the output data to at least one destination specified by the        graph.    -   Clause 90. Non-transitory computer-readable media comprising        instructions executable by a computing system to:        -   provide a user interface depicting a graph representing a            data processing pipeline, wherein the graph comprises a            first data processing node interconnected with a machine            learning model;        -   receive, via the user interface, a request to activate a            preview mode in association with the machine learning model;        -   obtain first data generated by the first data processing            node;        -   apply the first data as an input to the machine learning            model to generate output data;        -   determine that the output data comprises a first number of a            first label type and a second number of a second label type;        -   select a first subset of the first number of the first label            type and a second subset of the second number of the second            label type; and        -   cause the user interface to display a preview of the output            data output by the machine learning model that comprises the            first subset of the first number of the first label type and            the second subset of the second number of the second label            type.    -   Clause 91. A method, comprising:        -   obtaining first raw machine data from an event data stream            generated by one or more components in an information            technology environment;        -   updating a model using the first raw machine data and a            first machine learning algorithm to generate an evolved            model;        -   obtaining second raw machine data from the event data stream            generated by the one or more components in the information            technology environment;        -   generating a first updated model using the second raw            machine data, the first machine learning algorithm, and the            evolved model;        -   generating a second updated model using the second raw            machine data, a second machine learning algorithm, and the            evolved model;        -   comparing an accuracy of the first updated model and an            accuracy of the second updated model on a particular set of            data;        -   determining that the second updated model is more accurate            than the first updated model;        -   obtaining third raw machine data from the event data stream            generated by the one or more components in the information            technology environment; and        -   processing the third raw machine data from the event data            stream using the second updated model.    -   Clause 92. The method of Clause 91, wherein the first machine        learning algorithm comprises a transformation operation and a        reference to a storage location of a model state of the first        updated model.    -   Clause 93. The method of Clause 91, wherein the first machine        learning algorithm comprises a transformation operation and a        reference to a storage location of a model state of the first        updated model, and wherein the second machine learning algorithm        comprises a second transformation operation and a reference to a        storage location of a model state of the second updated model.    -   Clause 94. The method of Clause 91, wherein the first machine        learning algorithm comprises a transformation operation and a        reference to a storage location of a model state of the first        updated model, wherein the second machine learning algorithm        comprises a second transformation operation and a reference to a        storage location of a model state of the second updated model,        and wherein the method further comprises swapping the        transformation operation with the second transformation        operation in response to the determination that the second        updated model is more accurate than the first updated model.    -   Clause 95. The method of Clause 91, wherein the first updated        model and the second updated model obtain the particular set of        data from a source specified by a graph representing a data        processing pipeline.    -   Clause 96. The method of Clause 91, wherein the first updated        model and the second updated model obtain the particular set of        data from a source specified by a graph representing a data        processing pipeline, and wherein a version of an output of the        first updated model is written to an external storage system        specified by the graph.    -   Clause 97. The method of Clause 91, wherein the first updated        model and the second updated model obtain the particular set of        data from a source specified by a graph representing a data        processing pipeline, wherein a version of an output of the first        updated model is written to an external storage system specified        by the graph, and wherein an output of the second updated model        is not written to any external storage system until the second        updated model is determined to be more accurate than the first        updated model.    -   Clause 98. The method of Clause 91, wherein the first updated        model and the second updated model obtain the particular set of        data from a source specified by a graph representing a data        processing pipeline, wherein a version of an output of the first        updated model is written to an external storage system specified        by the graph, wherein an output of the second updated model is        not written to any external storage system until the second        updated model is determined to be more accurate than the first        updated model, wherein comparing an accuracy of the first        updated model and an accuracy of the second updated model on a        particular set of data further comprises:        -   determining, a time period after the second updated model is            generated, whether to continue writing the version of the            output of the first updated model to the external storage            system or whether to begin writing a version of the output            of the second updated model to the external storage system;            and        -   comparing the accuracy of the first updated model and the            accuracy of the second updated model on a particular set of            data to determine which version of output to write to the            external storage system.    -   Clause 99. The method of Clause 91, further comprising        generating a first prediction associated with the first raw        machine data in response to an application of the first raw        machine data as an input to the model.    -   Clause 100. The method of Clause 91, wherein comparing an        accuracy of the first updated model and an accuracy of the        second updated model further comprises:        -   obtaining a set of further raw machine data from the event            data stream; generating one or more first predictions            associated with the set of further raw machine data in            response to an application of the set of further raw machine            data as an input to the first updated model;        -   generating one or more second predictions associated with            the set of further raw machine data in response to an            application of the set of further raw machine data as an            input to the second updated model; and        -   comparing an accuracy of the one or more first predictions            to an accuracy of the one or more second predictions.    -   Clause 101. The method of Clause 91, wherein comparing an        accuracy of the first updated model and an accuracy of the        second updated model further comprises:        -   obtaining a set of further raw machine data from the event            data stream that represents raw machine data obtained from            the event stream over a threshold period of time;        -   generating one or more first predictions associated with the            set of further raw machine data in response to an            application of the set of further raw machine data as an            input to the first updated model;        -   generating one or more second predictions associated with            the set of further raw machine data in response to an            application of the set of further raw machine data as an            input to the second updated model; and        -   comparing an accuracy of the one or more first predictions            to an accuracy of the one or more second predictions.    -   Clause 102. The method of Clause 91, wherein comparing an        accuracy of the first version of the second updated model and an        accuracy of the second version of the second updated model        further comprises comparing a loss associated with the first        updated model and a loss associated with the second updated        model.    -   Clause 103. The method of Clause 91, wherein generating a first        updated model further comprises updating, in a production stack,        the evolved model using the second raw machine data and the        first machine learning algorithm.    -   Clause 104. The method of Clause 91, wherein generating a second        updated model further comprises updating, in a test stack        separate from a production stack, the evolved model using the        second raw machine data and the second machine learning        algorithm.    -   Clause 105. The method of Clause 91, wherein generating a second        updated model further comprises updating, in a test stack        separate from a production stack, the evolved model using the        second raw machine data and the second machine learning        algorithm, and wherein the method further comprises re-training,        in the production stack, the second updated model using the        third raw machine data and the second machine learning        algorithm.    -   Clause 106. The method of Clause 91, further comprising:        -   obtaining a set of further raw machine data from the event            data stream;        -   generating, in a production stack, one or more first            predictions associated with the set of further raw machine            data in response to an application of the set of further raw            machine data as an input to the first updated model;        -   generating, in a test stack separate from the production            stack, one or more second predictions associated with the            set of further raw machine data in response to an            application of the set of further raw machine data as an            input to the second updated model; and        -   generating, in the production stack, a third prediction the            third raw machine data and the second updated model.    -   Clause 107. The method of Clause 91, further comprising:        -   generating a third updated model using the second raw            machine data, a third machine learning algorithm, and the            evolved model;        -   comparing an accuracy of the first updated model, an            accuracy of the second updated model, and an accuracy of the            third updated model; and        -   determining that the second updated model is more accurate            than the first updated model and the third updated model.    -   Clause 108. The method of Clause 91, further comprising:        -   generating, in a background environment separate from an            environment in which the first updated model is generated, a            third updated model using the second raw machine data, a            third machine learning algorithm, and the evolved model;        -   comparing an accuracy of the first updated model, an            accuracy of the second updated model, and an accuracy of the            third updated model;        -   determining that the second updated model is more accurate            than the first updated model and the third updated model.    -   Clause 109. The method of Clause 91, wherein processing the        third raw machine data from the event data stream using the        second updated model further comprises:        -   swapping the first updated model with the second updated            model in a production stack; and        -   processing the third raw machine data and subsequent raw            machine data using the second updated model in the            production stack.    -   Clause 110. The method of Clause 91, wherein a data ingestion        pipeline comprises an operator that implements the first machine        learning algorithm, and wherein the method further comprises        refreshing the data ingestion pipeline to replace the operator        with a second operator that implements the second machine        learning algorithm.    -   Clause 111. The method of Clause 91, wherein a data ingestion        pipeline comprises an operator that implements the first machine        learning algorithm, and wherein the method further comprises:        -   refreshing the data ingestion pipeline to replace the            operator with a second operator that implements the second            machine learning algorithm; and        -   processing the third raw machine data and subsequent raw            machine data in the data ingestion pipeline using second            operator.    -   Clause 112. The method of Clause 91, wherein the first updated        model and the second updated model are generated prior to the        second raw machine data being stored in a data intake and query        system.    -   Clause 113. The method of Clause 91, wherein the first updated        model and the second updated model are generated prior to the        second raw machine data being stored in a data intake and query        system and prior to the third raw machine data being ingested        into the data intake and query system.    -   Clause 114. The method of Clause 91, wherein the first updated        model and the second updated model are generated in parallel.    -   Clause 115. The method of Clause 91, further comprising        generating one or more predictions using the first updated model        and the second updated model in parallel.    -   Clause 116. The method of Clause 91, wherein the evolved model        comprises one or more machine learning model parameters.    -   Clause 117. The method of Clause 91, wherein the evolved model        comprises one or more machine learning model parameters, and        wherein generating a second updated model using the second raw        machine data and a second machine learning algorithm further        comprises updating at least one of the one or more machine        learning model parameters using the second raw machine data and        the second machine learning algorithm.    -   Clause 118. The method of Clause 91, wherein the evolved model        comprises one or more hyperparameters.    -   Clause 119. A system, comprising:        -   one or more data stores including computer-executable            instructions; and        -   one or more processors configured to execute the            computer-executable instructions, wherein execution of the            computer-executable instructions causes the system to:            -   obtain first raw machine data from an event data stream                generated by one or more components in an information                technology environment;            -   update a model using the first raw machine data and a                first machine learning algorithm to generate an evolved                model;            -   obtain second raw machine data from the event data                stream generated by the one or more components in the                information technology environment;            -   generate a first updated model using the second raw                machine data, the first machine learning algorithm, and                the evolved model;            -   generate a second updated model using the second raw                machine data, a second machine learning algorithm, and                the evolved model;            -   compare an accuracy of the first updated model and an                accuracy of the second updated model on a particular set                of data;            -   determine that the second updated model is more accurate                than the first updated model;            -   obtain third raw machine data from the event data stream                generated by the one or more components in the                information technology environment; and            -   process the third raw machine data from the event data                stream using the second updated model.    -   Clause 120. Non-transitory computer-readable media comprising        instructions executable by a computing system to:    -   obtain first raw machine data from an event data stream        generated by one or more components in an information technology        environment;    -   update a model using the first raw machine data and a first        machine learning algorithm to generate an evolved model;    -   obtain second raw machine data from the event data stream        generated by the one or more components in the information        technology environment;    -   generate a first updated model using the second raw machine        data, the first machine learning algorithm, and the evolved        model;    -   generate a second updated model using the second raw machine        data, a second machine learning algorithm, and the evolved        model;    -   compare an accuracy of the first updated model and an accuracy        of the second updated model on a particular set of data;    -   determine that the second updated model is more accurate than        the first updated model;    -   obtain third raw machine data from the event data stream        generated by the one or more components in the information        technology environment; and    -   process the third raw machine data from the event data stream        using the second updated model.

Any of the above methods may be embodied within computer-executableinstructions which may be stored within a data store or non-transitorycomputer-readable media and executed by a computing system (e.g., aprocessor of such system) to implement the respective methods.

What is claimed is:
 1. A method, comprising: providing a user interfacedepicting a graph representing a data processing pipeline; receiving,via the user interface, a request to activate a preview mode inassociation with a machine learning model; obtaining first datagenerated by a component of the data processing pipeline interconnectedwith the machine learning model; applying the first data as an input tothe machine learning model to generate output data; and causing the userinterface to display a preview of a portion of the output data thatcomprises a sampling of one or more different types of data output bythe machine learning model.
 2. The method of claim 1, wherein causingthe user interface to display a preview further comprises causing theuser interface to display the preview without writing the output data toat least one destination specified by the graph.
 3. The method of claim1, further comprising retrieving input data from at least one sourcespecified by the graph in response to the request to activate thepreview mode.
 4. The method of claim 1, wherein the first data compriseslive data streamed from a source specified by the graph.
 5. The methodof claim 1, further comprising: retrieving input data from at least onesource specified by the graph in response to the request to activate thepreview mode; and causing the input data to be transformed according tothe component of the data processing pipeline to generate the firstdata.
 6. The method of claim 1, further comprising transmitting anabstract syntax tree (AST) of the data processing pipeline to an intakesystem, wherein the intake system produces an augmented AST by causing afunction of the graph that writes to an external database to dropreceived data instead of writing the received data to the externaldatabase and by adding a preview node to the graph in association withthe machine learning model.
 7. The method of claim 1, further comprisingtransmitting an abstract syntax tree (AST) of the data processingpipeline to an intake system, wherein the intake system produces anaugmented AST by causing a function of the graph that writes to anexternal database to drop received data instead of writing the receiveddata to the external database and by adding a preview node to the graphin association with the machine learning model, and wherein the intakesystem runs a job using the augmented AST that results in the first databeing transmitted to the preview node.
 8. The method of claim 1, furthercomprising transmitting an abstract syntax tree (AST) of the dataprocessing pipeline to an intake system, wherein the intake systemproduces an augmented AST by causing a function of the graph that writesto an external database to drop received data instead of writing thereceived data to the external database and by adding a preview node tothe graph in association with the machine learning model, wherein theintake system runs a job using the augmented AST that results in thefirst data being transmitted to the preview node, and wherein applyingthe first data as an input to the machine learning model to generateoutput data further comprises applying, by the preview node, the firstdata as an input to the machine learning model to generate output data.9. The method of claim 1, wherein the first data comprises a stream ofdata items generated by the component of the data processing pipeline insequence, and wherein applying the first data as an input to the machinelearning model further comprises applying, in sequence, each of the dataitems of the stream of data items as an input to the machine learningmodel to generate the output data.
 10. The method of claim 1, whereinapplying the first data as an input to the machine learning model togenerate output data further comprises applying the first data as theinput to the machine learning model for a first period of time.
 11. Themethod of claim 1, wherein applying the first data as an input to themachine learning model to generate output data further comprisesapplying the first data as the input to the machine learning model for afirst period of time, and wherein the first data corresponds to a secondperiod of time.
 12. The method of claim 1, wherein applying the firstdata as an input to the machine learning model to generate output datafurther comprises applying the first data as the input to the machinelearning model for a first period of time, and wherein the first datacorresponds to a second period of time greater than the first period oftime.
 13. The method of claim 1, wherein the first data comprises astream of data items generated by the component of the data processingpipeline in sequence, wherein applying the first data as an input to themachine learning model to generate output data further comprises: foreach data item of the stream of data items in sequence, applying therespective data item as an input to the machine learning model togenerate a portion of the output data; and determining, a first periodof time after an initial portion of the output data is generated, thatno portion of the output data corresponds to a first type of label. 14.The method of claim 1, wherein the first data comprises a stream of dataitems generated by the component of the data processing pipeline insequence, wherein applying the first data as an input to the machinelearning model to generate output data further comprises: for each dataitem of the stream of data items in sequence, applying the respectivedata item as an input to the machine learning model to generate aportion of the output data; determining, a first period of time after aninitial portion of the output data is generated, that no portion of theoutput data corresponds to a first type of label; and stoppingapplication of the stream of data items as an input to the machinelearning model.
 15. The method of claim 1, wherein the sampling of theone or more different types of data output by the machine learning modelcomprises a sampling of one or more different label types found in theoutput data.
 16. A system, comprising: one or more data stores includingcomputer-executable instructions; and one or more processors configuredto execute the computer-executable instructions, wherein execution ofthe computer-executable instructions causes the system to: provide auser interface depicting a graph representing a data processingpipeline; receive, via the user interface, a request to activate apreview mode in association with a machine learning model; obtain firstdata generated by a component of the data processing pipelineinterconnected with the machine learning model; apply the first data asan input to the machine learning model to generate output data; andcause the user interface to display a preview of a portion of the outputdata that comprises a sampling of one or more different types of dataoutput by the machine learning model.
 17. The system of claim 16,wherein execution of the computer-executable instructions further causesthe system to cause the user interface to display the preview withoutwriting the output data to at least one destination specified by thegraph.
 18. The system of claim 16, wherein execution of thecomputer-executable instructions further causes the system to transmitan abstract syntax tree (AST) of the data processing pipeline to anintake system, wherein the intake system produces an augmented AST bycausing a function of the graph that writes to an external database todrop received data instead of writing the received data to the externaldatabase and by adding a preview node to the graph in association withthe machine learning model, wherein the intake system runs a job usingthe augmented AST that results in the first data being transmitted tothe preview node, and wherein execution of the computer-executableinstructions further causes the system to apply, by the preview node,the first data as an input to the machine learning model to generateoutput data.
 19. A non-transitory computer-readable medium comprisinginstructions executable by a computing system that, when executed, causethe computing system to: provide a user interface depicting a graphrepresenting a data processing pipeline; receive, via the userinterface, a request to activate a preview mode in association with amachine learning model; obtain first data generated by a component ofthe data processing pipeline interconnected with the machine learningmodel; apply the first data as an input to the machine learning model togenerate output data; and cause the user interface to display a previewof a portion of the output data that comprises a sampling of one or moredifferent types of data output by the machine learning model.
 20. Thenon-transitory computer-readable medium of claim 19, wherein theinstructions, when executed, further cause the computing system to causethe user interface to display the preview without writing the outputdata to at least one destination specified by the graph.